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Preface 



This volume contains papers selected for presentation at the 6th lAPR Workshop 
on Document Analysis Systems (DAS 2004) held during September 8-10, 2004 
at the University of Florence, Italy. Several papers represent the state of the art 
in a broad range of “traditional” topics such as layout analysis, applications to 
graphics recognition, and handwritten documents. Other contributions address 
the description of complete working systems, which is one of the strengths of 
this workshop. Some papers extend the application domains to other media, like 
the processing of Internet documents. 

The peculiarity of this 6th workshop was the large number of papers related 
to digital libraries and to the processing of historical documents, a taste which 
frequently requires the analysis of color documents. A total of 17 papers are 
associated with these topics, whereas two years ago (in DAS 2002) only a couple 
of papers dealt with these problems. 

In our view there are three main reasons for this new wave in the DAS 
community. From the scientific point of view, several research fields reached a 
thorough knowledge of techniques and problems that can be effectively solved, 
and this expertise can now be applied to new domains. Another incentive has 
been provided by several research projects funded by the EC and the NSF on 
topics related to digital libraries. Last but not least, the organization of focused 
events, like the recent DIAL workshop chaired by Henry Baird and Venn Govin- 
draju in Palo Alto (CA), had a strong impact on the definition of new research 
directions. However, it is indeed a lucky coincidence that this new trend in DAS 
research emerged in this edition organized in a town such as Florence, which 
keeps such an exceptional artistic and cultural heritage. 

We received a total of 79 submissions from 19 countries, and we selected 
31 oral presentations and 22 posters highlighted with short oral introductions. 
As a supplement to this proceedings, notes from the workshop discussions and 
other material related to presented papers will be posted on the DAS 2004 web- 
site: http://www.dsi.unifi.it/DAS04. Each paper was reviewed by three review- 
ers whom we would like to warmly thank here. We should mention the valuable 
support and hints provided by members of the Program Committee and past 
DAS chairs. We also wish to acknowledge the generosity of our sponsors: the 
International Association for Pattern Recognition, the University of Florence, 
the DFKI, ABBYY, Hitachi, and Siemens. 

Special thanks are due to Alessio Ceroni, Cristina Dolfi, and Emanuele Marino 
for their invaluable contributions to the local organization. 
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Abstract. Implications of technical demands made within digital li- 
braries (DL’s) for document image analysis systems are discussed. The 
state-of-the-art is summarized, including a digest of themes that emerged 
during the recent International Workshop on Document Image Analysis 
for Libraries. We attempt to specify, in considerable detail, the essential 
features of document analysis systems that can assist in: (a) the creation 
of DL’s; (b) automatic indexing and retrieval of doc-images within DL’s; 
(c) the presentation of doc-images to DL users; (d) navigation within 
and among doc-images in DL’s; and (e) effective use of personal and 
interactive DL’s. 



1 Introduction 

Within digital libraries (DL’s), imaged paper documents are growing in number 
and importance, but they are too often unable to play many of the useful roles 
that symbolically encoded (“born digital”) documents do. Traditional document 
image analysis (DIA) systems can relieve some, but not all, of these obstacles. 
In particular, the unusually wide variety of document images found in DL’s, 
representing many languages, historical periods, and scanning regimes, taken 
together pose an almost insuperable problem for present-day DIA systems. How 
should DIA systems be redesigned to assist in the solution of a far broader range 
of DIA problems than have ever been attempted before? 

Section 2 summarizes the principal points relevant to this question that were 
aired at the International Workshop on Document Image Analysis for Libraries 
(DIAL2004) . The issue of hardcopy books versus digital displays is raised in Sec- 
tion 3. Section 4 considers problems associated with document-image capture, 
legibility, completeness checking, support for scholarly study, and archival con- 
servation. Certain problems arising in early-stage image processing may require 
fresh DIA solutions, as described in Section 5. Section 6 points out implications 
for DIA systems of the lack of fully automatic, high-accuracy methods for ana- 
lyzing doc-image content. Needs for improved methods for presentation, display, 
printing, and reflowing of document images are discussed in Section 7. Retrieval, 
indexing, and summarization of doc-images is addressed in Section 8. Finally, 
Section 9 lists some problems arising in “personal” and interactive digital li- 
braries, followed by brief conclusions in Section 10. 
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2 The DIAL2004 Workshop 

The First International Workshop on Document Image Analysis for Libraries 
(January 23-24, 2004, Palo Alto, CA) brought together fifty-five researchers, 
end-users, practitioners, business people, and end-users who were all interested 
in new technologies assisting the integration of imaged documents within DL’s 
so that, ideally, everything that can be done with “born digital” data can also 
be done with scanned hardcopy documents. Academia, industry, and govern- 
ment in twelve countries were represented by researchers from the document 
image analysis, digital libraries, library science, information retrieval, data min- 
ing, and humanities fields. The participants worked together, in panels, debates, 
and group discussions, to describe the state of the art and identify urgent open 
problems. More broadly, the workshop attempted to stimulate closer cooperation 
in the future between the DIA and DL communities. 

Twenty-nine regular papers, published in the proceedings [7], established the 
framework for discussion, which embraced six broad topics: 

~ DIA challenges in historical DL collections; 

— handwriting recognition for DL’s; 

— multilingual DL’s; 

— DL systems architectures and costs; 

— retrieval in DL’s using DIA methods; and 

— content extraction from document images for DL’s. 

The remainder of this paper summarizes work relating to these topics, with the 
current section placing special emphasis on the first three areas. 

2.1 DIA Challenges in Historical DL Collections 

Image Acquisition. Image capture from historical artifacts needs special han- 
dling to counter the defects of document aging and the physical constraints of 
digitization. A DIA oriented approach is suggested to effectively increase reso- 
lution and digitization speed, as well as to ensure document preservation during 
scanning and quality control [6,35]. 

Bourgeouis et al. [35] use Signal to Noise Ratio (SNR) and other measures 
to demonstrate the loss of resolution/data in image compression formats, and 
recommend storage in 256 gray levels or true color. They observe that curators 
should be informed about the needs of DL technology and drawbacks of lossy 
file formats like JPEG. In addition, non-UV cold lights and automatic page 
turners are used to safeguard originals during scanning, and errors are countered 
by using skew, lighting and curvature correction for book bindings and color 
depth reduction for medieval documents. Character reconstruction is suggested 
to restore broken characters in ancient documents. 

Continuous scanning is followed by automatic frame cropping as an efficient 
and fast procedure to generate images from microfilm [9] . Fourier-Mellin trans- 
form is used to correct rotation/shear, scale and translation errors [28]. Morpho- 
logical operations, analysis of lightness and saturation in HLS (Hue, Lightness, 
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and Saturation) image data, and connected component analysis is used to remove 
reconstructed paper areas [5] . 

Layout Analysis and Meta-data Extraction. Layout analysis and meta- 
data extraction is a crucial step in creating an information base for historical 
DL’s. Even as researchers are gaining ground on complete recognition of text 
content from historical documents (Subsection 2.2), practical systems have been 
built using only the layout analysis stage of DIA [9, 26, 35]. 

Availability of images makes it possible to provide content based image re- 
trieval, using even structural features like color and layout. Marinai et al. [39] 
create an MXY tree structure during document segmentation and then use lay- 
out similarity as a feature to query documents by example. 

A historical DL should supplement content with meta-data describing textual 
features {e.g., date, author, place) and geometrical information {e.g., paragraph 
locations, image zones). Couasnon et al. use an automated Web-based system for 
collecting annotations of French archives [18]. The system combines automatic 
layout analysis with human-assisted annotation in a Web interface. 

Transcription of historical documents maps ASCII text to corresponding 
words in the document image. This is intended to circumvent the lack of perfect 
Optical Character Recognition (OCR) for ancient writing styles [23,33,66]. 



2.2 Handwriting Recognition for DL’s 

Although commercial products are available for typeset text, handwriting recog- 
nition has achieved success only in specialized domains. HMM-based character 
model recognizers are used in postal address recognition from mail-piece images 
[51,57]. This system relies on context information related to addresses. 

For transcript creation from historical documents, mapping systems use hand- 
writing recognition. OCR engines used in these applications cannot meet real- 
time recognition requests. Automatic author classification systems [65] use multi- 
stage binarization followed by identification of document writers using character 
features. For Hanja scripts, OCR and UI techniques [31] incorporate nonlinear 
shape normalization, contour direction features and recognizers based on Maha- 
lanobis distance to generate transcripts for Hanja (Korean) documents. 

A HMM based recognizer for large lexicons is examined for indexing historical 
documents in [23]. The system uses substring sharing, where a prefix tree is built 
from the lexicon. Entries that share the same prefix also share its computation 
without invoking the recognizer. Duration constraints on character states, choice 
pruning, and parallel decoding provide a speedup of 7.7 times. 

Zhang et al. [66] combine word model recognition and transcript mapping to 
create handwritten databases. Lavrenko et al. [34] suggest a holistic recognition 
technique wherein normalized word images are used as inputs to a HMM. Scalar 
and profile features are extracted from the images and an entire historical doc- 
ument is modeled as a HMM, with words constituting the state sequence. For a 
document written by a given author, state transition probabilities are obtained 
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by averaging word bi-gram probabilities collected from contemporary texts and 
previously transcribed writings of the target author. 



2.3 Multilingual DL’s 

Despite excellent advances in Latin script DL’s, research in other scripts such 
as Indie (Arabic, Bengali, Devanagari, and Telugu), Chinese, Korean, etc. is 
only recently receiving attention. Digital access to documents in these scripts is 
challenging by way of user interface (UI) design, layout analysis, and OCR. 

A multilingual DL system should support simultaneous storage, entry, and 
display of data in many scripts. Many non-Latin scripts have a complicated 
character set and need a separate encoding system [17]. The display and entry of 
these languages requires new fonts [40,47] and character input schemes. Also, to 
ensure compatibility and platform independence of data, a DL should not resort 
to customized solutions without completely examining existing standards. 

In terms of character encoding, the Unicode Consortium aims at providing 
a reliable encoding scheme for all scripts in the world [17]. It currently supports 
all commercial scripts and is accepted as a system standard by many DL re- 
searchers and software manufacturers [11,32,36,40,60,63]. Although alternate 
schemes have been suggested [43] , they do not have the compatibility and global 
acceptance of Unicode. On the storage front, XML is emerging as a versatile and 
preferred scheme for DL projects [3,32,53,63]. 

Turning to input and display techniques, multi-layered input schemes for 
phonetic scripts [52] are suggested for stylus/keypad based entry systems {e.g., 
for PDA’s). Keyboard mapping systems (INSCRIPT for Indie scripts) map the 
keys of a standard Q WERTY keyboard onto the characters of a target script [43] . 
This keyboard system is functional, but has a steep learning curve. Moreover, 
every keyboard has to be physically labeled before a user can associate the keys 
with relevant characters. TrueViz [36] uses a graphical keyboard for Russian 
script input. Kompalli et al. [32] use a transliteration scheme, where Devanagari 
characters are entered by phonetic equivalent strings in English. For example, 

the Devanagari character ^ is entered using the English equivalent ka. A GUI 
keyboard is also provided to enter special characters. 

The ability to display multiple languages on a single interface is dependent 
on the encoding schema and fonts used in a DL system. Most designers of mul- 
tilingual software resort to Unicode-based fonts, and software vendors provide 
detailed guidelines for internationalization [24]. 



2.4 Multilingual Layout Analysis 

Variation in the writing order of scripts, and the presence of language-specific 
constructs such as shirorekha (Devanagari), modifiers (Arabic and Devanagari), 
or non-regular word spacing (Arabic and Chinese) require different approaches to 
layout analysis. For instance, gaps may not be used to identify words in Chinese 
and Arabic. Techniques for script identification vary from identifying scripts of 
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individual words in a multilingual document [42] to those that determine scripts 
of lines [44] and entire text blocks [27,62]. Once a script is identified, script- 
specific line and word separation algorithms can be used [22] . 



2.5 Multilingual OCR 

Creation of data sets [30, 32] is a welcome development in providing training 
and testing resources for non-Latin script OCR. Providing data sets for certain 
scripts is a non-trivial task due to their large character sets and the variety of 
recognition units used by researchers [8, 13, 14, 38] . Some suggest splitting ground 
truth into components to provide truth at multiple levels of granularities [22]. 

Common methods for Indie script OCR use structural features to build de- 
cision trees [13, 14] or combine multiple knowledge bases to create statistical 
classifiers [8,38]. Govindaraju et al. [22] combine structural and statistical fea- 
tures in a hybrid recognizer. Character images are pre-classified into categories 
based on structural features. A three layer Neural Network or a Nearest Neighbor 
classifier is then used to recognize the images. 

Partial character matching is used for Chinese OCR [64] . When a character 
is presented to the recognizer, radicals or parts of characters are first identified. 
Classification of a sufficiently large number of components leads to recognition 
of the whole character. Tai et al. [61] use a multilayer perceptron network to 
divide Chinese characters into four layers. Classification at the lowest levels is 
followed by logical reconstruction to recognize characters. 

Holistic techniques are being used for off-line and online recognition of Arabic 
[1,4]. Psuedo-2D HMM’s are used for ligature modeling in online recognition of 
Hangul scripts [54]. Bazzi et al. [10] recognize Arabic and English, using word- 
based HMM’s with trigram character probabilities to improve recognition rates. 

3 Ink-on-Paper Versus Digital Displays 

Many physical properties of ink-on-paper assist human reading [50], e.g., light- 
weight, thin, flexible, markable, unpowered (and so “always-on”), stable, and 
cheap. Of course, digital display devices used to access today’s DL’s - desktop, 
laptop, and handheld computers, plus eBook readers, tablet PC’s, etc. - have 
many advantages, too: they are automatically and rapidly rewritable, interactive, 
and connected {e.g., wirelessly) via networks to vast databases. However, there 
remain many ways in which information conveyed originally as ink-on-paper may 
not be better delivered by digital means: these need to be better elucidated (for 
an extended discussion, see [19]). 

It is by no means certain that any digital delivery of document images can 
compete with paper for all, or even for the most frequent purposes. It is still 
true today, as Sellen and Harper [50] report, that “paper [remains] the medium 
of choice for reading, even when the most high-tech technologies are to hand.” 
They suggest these reasons: (a) paper allows “flexible [navigation] through doc- 
uments;” (b) paper assists “cross-referencing” of several documents at one time; 
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(c) paper invites annotation; and (d) paper allows the “interweaving of reading 
and writing.” 

New technologies such as E-ink [21] and Gyricon [25] promise electronic doc- 
ument display with more of the advantages of paper (and new advantages of 
electronics) . More - perhaps even fundamental - research into user-interactions 
with displays during reading and browsing appears to be needed to understand 
fully the obstacles to the delivery of document images via DL’s. 

4 Capture 

Since the capture of document images for use in DL’s usually occurs in large-scale 
batch operations during which documents may be damaged or destroyed, and 
which are too costly ever to be repeated, there is a compelling need for methods 
of designing document scanning operations so that the resulting images will serve 
a wide variety of uses for many years, not just those uses narrowly imagined at 
the time. Image quality should be, but often is not, carefully quantified, e.g., 
at a minimum: depth/color, color gamut and calibration, lighting conditions, 
digitizing resolution, compression method, and image file format. In addition 
to these, we need richer use-specific metrics of document image quality, tied 
quantitatively to the reliability of downstream uses (e.g., legibility, both machine 
and human). 

4.1 Scanner Specifications 

Digitizing resolutions (spatial sampling frequency) for textual documents typi- 
cally range today between 300 and 400 pixels/inch (ppi); 600 ppi is less common 
but is gaining as scanner speed and disk storage capacity increase. 

For what downstream uses are these rough guidelines sufficient? Research op- 
portunities here are many, of this general type: does a particular scanning regime 
for modern books and printed documents {e.g., 300 ppi 24-bit color) reliably pro- 
vide images (of text, at least) which will support the best achievable recognition 
accuracy in the future, as image processing methods improve? Or should we, as 
a research community, help develop more exacting scanning standards? 

A joint activity between AIIM and the Association for Suppliers of Printing, 
Publishing and Converting Technologies (NPES) is discussing an international 
standard (PDF- Archive) [45] to define the use of PDF for archiving and preserv- 
ing documents. 

Test targets for evaluating scanners include: 

— IEEE Std I67A-I987, a facsimile machine test target that is produced by 
continuous-tone photography, with patterns and marks for a large range of 
measurements of moderate accuracy; 

— AIIM Scanner Target, an ink-on-paper, halftone-printed target; and 

— RIT Process Ink Gamut Chart, a four-color (cyan, magenta, yellow, and 
black), halftone-printed chart for low accuracy color sensitivity determina- 
tions. 
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To what extent do existing test targets, e.g., AIIM [2] ANSI/AIIM MS-44-1988 
“Recommended Practice for Quality Control of Image Scanners” and MS-44, 
allow for the manual or automatic monitoring of image quality needed for DIA 
processing? Do we need to design new targets for this purpose? 

4.2 Measurement and Monitoring of Quality 

Certainly we must recommend that the technical specifications of scanning con- 
ditions be preserved and attached (as metadata) to the resulting images. For 
many existing databases of document images, this has not been done. To our 
knowledge there does not yet exist a recommendation for such standards. There- 
fore, tools for the automatic estimation of scanner parameters from images of 
text could be an important contribution to the success of DL’s. Exploratory re- 
search in this direction is under way {e.g., [55]), but many questions are as yet 
unanswered, for example, how accurate will these estimates be? Can we estimate 
most of the image quality parameters that affect recognition? Will they run fast 
enough to be applied in real time as the images are scanned? 

A few DIA studies have attempted to predict OCR performance and to choose 
image restoration methods to improve OCR, guided by automatic analysis of 
images (cf. [59] and its references). The gains, so far, are modest. Can these 
methods be refined to produce large improvements? Can improving image qual- 
ity, by itself, improve OCR results enough to obviate the need for post-OCR 
correction? 

5 Initial Processing 

A wide range of early-stage image processing tools are needed to support high- 
quality image capture. Image calibration and restoration must usually be spe- 
cialized to the scanner, and sometimes to the batch. Image processing should, 
ideally, occur quickly enough for the operator to check each page image visually 
for consistent quality; this modest capability is, as of yet, hard to achieve. Tools 
are needed for orienting pages so text is rightside up, for deskewing, deshear- 
ing, and dewarping, and for removing “pepper” noise and dark artifacts in book 
gutters and near edges of images. Software support for clerical functions such as 
page numbering and ordering, and the collection of metadata, are also crucial 
to maintaining high throughput. Few, if any, of these tasks present difficult DIA 
problems, but care is needed in the design of the user interface. 

One place where DIA technology could help is in checking each page image for 
completeness and consistency: (a) Has any text been unintentionally cropped? 
(b) Are basic measures of image consistency {e.g., brightness, contrast, intensity 
histograms) stable from page to page, hour after hour? (c) Are image properties 
consistent across the full page area for each image? These seem to be fairly 
challenging problems in general, but specific cases may yield to standard image 
processing techniques. 

Are the page numbers - located and read by OCR on-the-fly - in an unbroken 
ascending sequences, and do they correspond to the automatically generated 
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metadata? This problem is surely directly solvable using existing techniques, 
with perhaps the addition of string-correcting constraint-satisfaction analysis of 
the number sequences: however, we are not aware of any published solution. 
Perhaps it will someday be possible to assess both human and machine legibility 
on the fly (today this may seem a remote possibility, but cf. [16]). 



5.1 Restoration 

Document image restoration can assist fast and painless reading, OCR for tex- 
tual content, DIA for improved user experience {e.g., format preservation), and 
characterization of the document (age, source, etc.). To these ends, methods 
have been developed for contrast and sharpness enhancement, rectification (in- 
cluding skew and shear correction), super-resolution, and shape reconstruction 
(for a survey, see [37]), but there appear to be quite a few open problems. 

6 Analysis of Content 

The analysis and recognition of the content of document images requires, of 
course, the full range of DIA R&D achievements: page layout analysis, text/non- 
text separation, typeset/handwritten separation, text recognition, labeling of 
text blocks by function, automatic indexing and linking, table and graphics 
recognition, etc. Most of the DIA literature is devoted to these topics. 

However, it should be noted that images found in DL’s, since they represent 
many nations, cultures, and historical periods, tend to pose particularly severe 
challenges to today’s DIA methods, and especially to the architecture of DIA 
systems, which are not robust in the face of multilingual text and non- Western 
scripts, obsolete typefaces, old-fashioned page layouts, and low or variable image 
quality. The sheer variety of document images that are rapidly being brought 
online threatens to overwhelm to the capabilities of state-of-the-art DIA systems; 
this fact, taken alone, suggests that a fruitful direction for DIA R&D is a search 
for tools that can reliably perform specific, perhaps narrowly defined, tasks across 
the full range of naturally occurring documents. These might include: 

1. Does an image contain any printed or handwritten text? 

2. Does it contain a long passage {e.g., 50 or more words) of text? 

3. Isolate all textual regions, separating them from non-text and background; 

4. Identify/segment handwritten from machine-printed text; and 

5. Identify script (writing system) and language of regions of text. 

This might be called a breadth-first (or versatility-first) DIA strategy. Most of 
these tasks have, of course, already received some attention in the literature. 
What is new, perhaps, is the emphasis on achieving some level of competency 
(perhaps not always high) across orders of magnitude more document image 
types than has been attempted thus far. 
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6.1 Accurate Transcriptions of Text 

The central task of DIA research has long been to extract a full and perfect tran- 
scription of the textual content of document images. No existing OCR technol- 
ogy, experimental or commercially available, can guarantee near-perfect accuracy 
across the full range of document images of interest to users. Furthermore, it is 
rarely possible - even for an OCR expert - to predict how badly an OCR system 
will fail on a given document. Even worse, it is usually impossible to estimate 
automatically, after the fact, how badly an OCR system has performed (but, see 
[49]). This combination of unreliability, unpredictability, and untrustworthiness 
forces expensive manual “proofing” (inspection and correction) in all document 
scan-and-conversion projects that require a uniformly high standard of accuracy. 
(Of course, if an average high accuracy across a large set of documents is needed, 
existing commercial OCR systems may be satisfactory.) 

The open problems here are clearly difficult, urgent, and many, but they are 
also already thoroughly discussed in the DIA literature {e.g., [41] and [48]). 

6.2 Labeling of Structure 

DL’s would certainly benefit from DIA facilities able to label every part of docu- 
ment structure to the degree of refinement supported by markup languages such 
as XML. Of course, the general case of this remains a resistant class of DIA prob- 
lems. However, even partial solutions might be useful in DL’s since they would 
aid in navigation within and among documents, capturing some of the flexibility 
that keeps paper competitive with DL’s. Navigation can be assisted by a wide 
range of incomplete, and even errorful, functional labelings for the purposes of, 
for example, creating indices and overviews (at various levels of detail), jumping 
from one section to the next, following references to figures, and so on. 

7 Presentation, Printing, and Reflowing 

Paper invite the “spreading out” of many pages over large surfaces. The relative 
awkwardness of digital displays is felt particularly acutely here. When attempt- 
ing to read images of scanned pages on electronic displays, it is often difficult to 
avoid panning and zooming, which quickly becomes irritating and insupportable. 

This problem has been carefully and systematically addressed by several 
generations of eBook design, and progress is being made toward high-resolution, 
grayscale and color, bright, high contrast, lightweight, and conveniently-sized 
readers for page images. But even when eBooks approach paper closely enough 
to support our most comfortable habits of reading, there will still be signifi- 
cant needs for very large displays so that large documents {e.g., maps, music, 
engineering drawings) and/or several-at-once smaller documents can be taken 
in at one glance. Perhaps desktop multi-screen “tiled” displays will come first; 
but eventually it may be necessary to display documents on desk-sized or wall- 
sized surfaces. The DIA community should help the design of these displays and 
should investigate versatile document-image tiling algorithms. 
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In many printed materials, the author’s and editor’s choice of typeface, type- 
size, and layout are not merely aesthetic, they are meaningful and critical to 
understanding. Even if DIA could provide “perfect” transcription of the textual 
content (as ASCII/Unicode/XML), many critical features of its original appear- 
ance may have been discarded. Preserving all of these stylistic details through 
the DIA pipeline remains a difficult problem. One solution to this problem is, of 
course, multivalent representations where the original image is always available 
as one of several views. 

Recently, DIA researchers have investigated systems for the automatic anal- 
ysis of document images into image fragments {e.g., word images) that can be 
reconstructed or “reflowed” onto a display device of arbitrary size, depth, and 
aspect ratio {e.g., [12]). The intent is to allow imaged documents to be read 
on a limited-resolution, perhaps even handheld, computing device, without any 
errors or losses due to OCR and retypesetting, thus mimicking one of the most 
useful features of encoded documents in DL’s. It also holds out the promise 
of customizable print-on-demand services and special editions, e.g., large-type 
editions for the visually impaired. 

This is a promising start, but, to date, document image reflowing systems 
work automatically only on body text and still have some problems with read- 
ing order, hyphenation, etc. Automation of link-creation (to figures, footnotes, 
references, etc.) and of indices {e.g., tables of contents) would greatly assist nav- 
igation on small devices. It would be highly useful to extend reflowing to other 
parts of document images such as tables and graphics, difficult as it may be to 
imagine how this could be accomplished under the present state of the art. 

Similar issues arise when users wish to reprint books or articles found in DL’s. 
It should be possible for such a user to request any of a wide range of output 
formats, e.g., portrait or landscape, multiple “pages per page,” pocketbooks, 
large-type books, etc. In most of these cases, some DIA problem needs to be 
solved. 



8 Indexing, Retrieval, and Snmmarization 

Both indexing and retrieval of document images are critical for the success of 
DL’s. To pick only a single example, the JSTOR DL [29] includes over 12 million 
imaged pages from over 300 scholarly journals and allows searching on (OCRed) 
full text as well as on selected metadata (author, title, or abstract field). Most 
published methods for retrieval of document images first attempt recognition 
and transcription followed by indexing and search operating on the resulting (in 
general, erroneous) encoded text (using, e.g., standard “bag-of-words” informa- 
tion retrieval (IR) methods). The excellent survey [20] summarized the state of 
the art (in 1997) of retrieval of entire multi-page articles as follows: 

1. at OCR character error rates below 5%, IR methods suffer little loss of either 
recall or precision; and 

2. at error rates above 20%, both recall and precision degrade significantly. 
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There is a small but interesting literature on word-spotting “in the image do- 
main.” These approaches seem to offer the greatest promise of large improve- 
ments in recall and precision (if not in speed). An open problem, not much 
studied, is the effectiveness of OCR^IR methods on very short passages, such 
as, in an extreme but practically important case, short fields containing key 
metadata (title, author, etc.). Many textual analysis tasks {e.g., those that de- 
pend on syntactic analysis), whether modeled statistically or symbolically, can 
be derailed by even low OCR error rates. 



8.1 Summarizations and Condensation 

There has been, to our knowledge, only a single DIA attack on the problem 
of summarization of documents by operating on images, not on OCRed text. 
In this work [15], word-images were isolated and compared by shape (without 
recognition) and thereby clustered. The cluster occurrences and word sizes were 
used to distinguish between stop words and non-stop words, which were then 
used to rank (images of) sentences in the usual way. 

This successful extension of standard information retrieval methods into the 
purely image domain should spur investigation of similar extensions, for example, 
methods for condensing document images by abstracting them into a set of 
section headers. 



8.2 Non-textual Content 

Non-textual content such as mathematical expressions, chemical diagrams, tech- 
nical drawings, maps, and other graphics have received sustained attention by 
DIA researchers, but it may be fair to say that search and retrieval for these 
contents is at a much less mature stage than for text. 

9 Personal and Interactive Digital Libraries 

Research has recently gotten underway into “personal digital libraries,” with 
the aim of offering tools to individuals willing to try to scan their own docu- 
ments and, mingling imaged and encoded files, assemble and manage their own 
DL’s. All the issues we have mentioned earlier are applicable here, but perhaps 
there is special urgency in ensuring that all the images are legible, searchable, 
and browseable. Thus there is a need for deskilled, integrated tools for scan- 
ning, quality control and restoration, ensuring completeness, adding metadata, 
indexing, redisplay, and annotation. An early example of this, using surprisingly 
simple component DIA technologies informally integrated, is described in [56]. 
In addition, this might spur more development and wider use of simple-to-use, 
small-footprint personal scanners and handheld digital cameras to capture doc- 
ument images, with a concomitant need for DIA tools (perhaps built into the 
scanners and cameras) for image dewarping, restoration, binarization, etc. 
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In addition, one may wish to detect duplicates (or near duplicates), either to 
prune them or to collect slightly differing versions of a document; the DIA liter- 
ature offers several effective attacks on this problem (cf. [20]), operating both in 
the textual and the image domain. Even when document content starts out in 
encoded form (is “born digital”), document image analysis can still be impor- 
tant. For instance, how might duplicate detection be performed when one of the 
versions is in PDF format and the other is in DjVu? The common denominator 
must be the visual representation of the document since, from the point of view 
of individual (especially non-professional) users, the visual representation will 
be normative. 

Often, users may wish to be able to perform annotation using pen-based input 
(on paper or with a digital tablet/stylus). A role for document image analysis 
here could be annotation segmentation/lifting or word-spotting in annotations. 

9.1 Interactive and Shared Digital Libraries 

As publicly available DL’s gather large collections of document images, oppor- 
tunities will arise for collective improvement of DL services. For example, one 
user may volunteer to correct an erroneous OCR transcription; another may be 
willing to indicate correct reading order or add XML tags to indicate sections. 
In this way a multitude of users may cooperate to improve the usefulness of the 
DL collection without reliance on perfect DIA technology. Within such a com- 
munity of volunteers, assuming it could establish a culture of trust, review, and 
acceptance, DIA tools could be critically enabling. 

An example of such a cooperative volunteer effort, which is closely allied 
intellectually to the DIA field, is The Open Mind Initiative [58], a framework for 
supporting the development of “intelligent” software using the Internet. Based on 
the traditional open source method, it supports domain experts, tool developers, 
and non-specialist “netizens” who contribute raw data. 

Another example, from the mainstream of the DL field, is Project Gutenberg 
[46], the Internet’s oldest producer of free electronic books (eBooks or eTexts). 
As of November 2002, a total of 6, 267 “electronic texts” of books had been made 
available online. All the books are in the public domain. Most of them were typed 
in and then corrected (sometimes imperfectly) by volunteers working over the 
Web. Such databases are potentially useful to the DIA community as sources 
of high quality ground-truth associated with known editions of books, some of 
which are available also as images. These collections have great potential to drive 
DIA R&D relevant to DL’s, as well as to benefit from it. 

9.2 Providing DIA Tools for Building DL’s 

To assist such interactive projects, the DIA field should consider developing 
DIA tool sets freely downloadable from the Web, or perhaps run on DL servers 
on demand from users. These could allow, for example, an arbitrary TIFF file 
(whether in a DL or privately scanned) to be processed, via a simple HTML link, 
into an improved TIFF {e.g., deskewed). Each such user would be responsible 
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for ensuring that his/her attempted operation succeeded - or, less naively, there 
could be an independent review. The result would then be uploaded into the 
DL, annotated to indicate the operation and the user’s assurance (and/or the 
associated review). In this way, even very large collections of document images 
could be improved beyond the level possible today through exclusively automatic 
DIA processing. 

10 Conclusions 

In this paper, we have attempted to provide an overview of some of the chal- 
lenges confronting the builders of document analysis systems in the context of 
digital libraries. While it may seem disheartening to realize that so many im- 
portant problems remain unsolved, there is no doubt that both the DIA and 
DL communities have much to offer one another. As a practical testbed for 
document analysis techniques and a real-world application of enormous cultural 
importance, we anticipate that digital libraries will provide a valuable focus for 
work in our field. 
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Abstract. The development of an online version of the Trinity College Dublin 
Printed Catalogue, which list books from the IT”" C to 1872, is described. The 
principal benefit of the system is the ability to search on words and word stems 
in the title field. As the entries are in at least fourteen languages the language of 
each Roman script entry was determined, with a success rate of over 90%. The 
image of the entry from the catalogue is displayed. This hides the OCR errors. 



1 Introduction 

Most of the hooks and periodicals in the Library of Trinity College Dublin received 
up to 1872 are listed in the so called Printed Catalogue [1]. In the author’s experience 
it is not always easy to find the book sought. Recently I was told by a historian inter- 
ested in railways that the Library did not have Herapath’s Railway Magazine. It does 
have this work which is indexed under MAGAZINE which extends over six pages. 

It was the difficulty of looking up the catalogue and the fortuitous finding of some 
rare books when preparing an exhibition on “Computing before the Electronic Com- 
puter” that the idea for this project was conceived. It was carried out by undergraduate 
[2, 3] and postgraduate students [4, 5] commencing in 1990. 

In the following the history of the Library [6] is described briefly. The develop- 
ment and general structure of the catalogue is then described. The major part of the 
project consisted of five principal steps: scanning and OCR, OCR output correction, 
natural language recognition, indexing, and the development of the user interface. A 
demonstration will be given at the Workshop. 



2 The Library of Trinity College Dublin 

Trinity College Dublin was founded in 1592. The Library surely was started later in 
that decade. Luke Challoner, who made a major contribution to the development of 
the College, and James Ussher, one of the first Scholars and Fellows of the College, 
were sent on book buying expeditions to London, Oxford and Cambridge and by 1610 
there were about 4000 books in the Library. Over the years there were a number of 
major donations. The first major donation was that of Ussher’s Library in 1661 by the 
British House of Commons. He had resigned from the College to accept an appoint- 
ment as a bishop and he eventually became Archbishop of Armagh. He was a re- 
nowned scholar and his library consisted of about 10,000 books and manuscripts. It is 
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stored today in the Long Room on the shelves on which it was originally placed in 
1732 when the Library building was eventually completed, construction having 
started in 1712. In 1743 the library of 13,000 books collected by Claudius Gilbert, a 
former Vice-Provost and Librarian, was given to the College. The next major acquisi- 
tion was purchased by the College from Hendrik Fagel, Chief Minister of Holland, in 
1802. This consisted of 20,000 books and added considerably to the number of Dutch 
and other continental books. In 1801, following the Act of Lfnion between Great Brit- 
ain and Ireland, the Library became a legal deposit library and was able to claim a 
free copy of every book and periodical published in Great Britain and Ireland, a privi- 
lege it retains to this day. Consequently the number of acquisitions increased dramati- 
cally and by the 1830s the existing manuscript catalogues were in poor condition. In 
1831 James Henthorn Todd, a Hebrew and Irish scholar, was appointed assistant li- 
brarian at no salary and he set about improving the cataloguing with a view to prepar- 
ing a printed catalogue. 



3 The Printed Catalogue 

Finding the existing catalogues to be almost useless Todd persuaded the Board of the 
College to appoint permanent Library Clerks for the first time and set about catalogu- 
ing the books on slips of paper in 1835. This work was completed in 1846 and Todd 
was now in a position to commence printing. He used Bandinel’s Catalogue [7] of the 
Bodleian Library in Oxford as a model but made significant improvements. Unlike 
Bandinel he decided not to set a date at which he would stop adding books as he knew 
that printing would take a long time and many books would be unreasonably ex- 
cluded. As the catalogue was to contain both primary and secondary entries and was 
to be printed over many years it is possible that secondary entries will be found with- 
out primary entries and vice-versa. A primary entry which is listed under author (or 
other keyword such as Ireland) and contains the title, place of publication, date size, 
edition, no. of volumes etc. and finally as many shelf marks as there are copies of the 
book. A secondary entry is listed under a different heading and the entry is an 
abridgement of the title with the primary entry heading printed in capitals. An exam- 
ple of both is shown in Figure 1. A knowledge of Latin was presumed! 

CHR.1STIUS (Johannes Fridericus). — De Nicolao 
Machiavello libri tres, in quibus de vita et scriptis, 
item de secta eius viri atque in universum de poli- 
tica nostrorum post instauratas litteras temporum 
, . . ratio habetur. 

Ltpstce ei JIal. ijii. 4°. 00. f. i. 

Fag. G. I. 43 . 

— Joh. Frider. ChriSTII de N. Machiavello libri tres. 

00. f. I. 

Fig. 1. A primary and secondary entry. 

The imprint in the primary entry starts at the beginning of a new line unlike the 
older catalogues. Fig. 2 shows how effective this is compared with the Bodleian Cata- 
logue of 1843. Note that an elongated hyphen — is placed at the start of each separate 
entry in the TCD Catalogue unlike the Bodleian Catalogue where it is only used for 
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DET-AMBRE Baptiste Jos^h). — Methodes 

aiialytiques pour la determination d’tm arc du 
raeridien; pr^ced^es d’lin memoLre sur le mftme 
swjct, par A. M. Legendre. 

/fer/y, an. vn. [179^. 4^. L. f. 49. 

— Base du systfsfnr: mfctrique decimal©, ou inesure de 

I'arc du m6ritlien compris entre les paralUleS de 
Dunkerque et Barcelona, execute© par MM. M6- 
chain et Delarabre. Redig^e par M. DeJambre. 
Pan'Sf i8ort-ir>. 3 tom; W. pp. 

— Rapport historique sur les progi*fes des sciences 

nmth^matiqucs depuis 1789, et surleur6tat actuel. 
Parts, 1810. 4'". 00. 00. 26. N*. 2. 

— Abr^.g^ d'astronomie ; ou, lemons 614roentaires 
d'astronomie. 

1813. 8®. L. hh. 4, 

— Astronomie th^ozique et pratique. 

Paris, iS 14. 4°. 5 toni. L. g. 46-48- 

— Histoiro de rastronomie ancieime. 

Paris, 1817. 4®. atom. L. g. 4i»42- 

— Histoirc do Tastronomie du moyen age. 

Paris, iSi^. 4®. L. g. 43. 

— Histoire de Tastronomic moderne. 

Paris, r82i. 4®. ^ tom. i.. g. 44» 45- 

— Histoire de I'astronomie, au 18 sl6cle, Publi§e 

par M. Mathieu. 



DELA^IBUEi (Ic cbor. J. B. Jo!).) prof, d'asironMnic 
mi coll. I'rtyftl tli' France. 

Bn|ipurt hi»ioric{Ui< sur let progrl’t drs innthcinaliqiini ile* 
1789, M Mir k'ur ^lal nclurl. 4-. Par. 1810, 

Aitrminniie llicoriqitu ct pratique; 3 voll. 4'. Par. 1814. 

Tnblca dclipliquca lies ulcllitn dr J upllcr, d'npris In tlitlaric 
du mim|. dr Ln|i 1 iiri' rl In lotalilc^ drs uliirrvniinm rnilrn 
denuii 166} juHiu'rn 1832. 4*. Par. 1817. 

[Ces liibirs el les inblra du soliil «e irouvernm duns b* 
ubics Atirommiguci, piiblii'm pnr Ic burenu dri 
longitudrn, q. v.] 

Hitluirc dc rantrimuiDiv nneienne ; 2 voll. 4*. /Vir. 1817. 

HiMoine ilc rnslroiiomic <Ui nioycn ii"c. 4". Pnr. 1819, 

Hislnirr ilc rastronomie moderne; 2 voil. 4'. Par. 1S21. 

Histoire dr rnslroliomie ttii din-builitine riitlc; poblii'e pnr 
Jf. Mnihicil. 4”. Par. 1.S27. 

MiSmuire stir r«rlllimctiqoe dcs Grets 1 p. 31 1. ml. tt. ties 
iKilvm d'Arehimiilf Irnd. (wr 1'. I'ej'rard, q. v. 

Base du syiieme nii^trique ct dtvimni, mi tiiNure dr I'nre dii 
ineridim conipris eiiirc Dunkeniue et llurreliim-, |ur M M . 
Dcinnibre ct Mkliuin, q. v. 



Fig. 2 . (a) 1872 Printed Catalogue on the left (b) Bodleian Catalogue on the right. 



different editions of the same work. Todd claimed, with justification, that his ar- 
rangement was advantageous and made it easier for the eye to run down the page 
when trying to find a book. The Printed Catalogue [1] is a finding-catalogue and Todd 
did not aim to produce a full descriptive catalogue. A shelf number is included in both 
the primary and in the principal type of secondary entry. This latter is very conven- 
ient, especially when a secondary appeared before a primary. Note that Bandinel’s 
Bodleian Catalogue did not include a shelf mark. These were written in by hand. 
Todd realised that the position of some books may change and indeed they have but 
the vast majority are in the same place. Examples of the five types of shelf mark are 
shown in Figure 3. 



Long Room: L. f. 8,9 

Long Room Gallery: Gall. MM. 6 . 32 

East and West End Gallery: Gall. 3. f. 34. 

Eagel: Eag. L. 2. 15. 

Quin N° 123 

Fig. 3 . Examples of shelf marks. 

The A-B volume, together with a Supplement, was in the printer’s hands from 
1849 to 1862 and was published in 1864. Todd, who had been appointed Librarian in 
1852, died in 1869 before the other volumes were printed. Henry Dix Hutton was 
given the task of editing the remaining seven volumes and a supplement which also 
contained an addenda and corrigenda. This has been made available on the Workshop 
website. The T-Z volume was printed in 1885 and the Supplement in 1887. The whole 
project took 52 years. The 5121 pages of one set of the eight volumes were separated 
in 1987 in order to make a microfiche copy and these pages were used in the project 
described in this paper. There are about 250,000 entries in the catalogue. 

The catalogue contains entries in at least fourteen languages. English and Latin oc- 
cur most frequently and other languages in the Roman alphabet include French, Ital- 
ian, Spanish, Portuguese, German, Dutch, Danish, Norwegian, Swedish and Irish. 
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There are also entries in non-Roman alphabets: Greek, Hebrew, Cyrillic, Arabic and 
Syriac. If an author’s works are mainly in Latin the Latin version of the name was 
used. For example ‘KEPLER’ is listed under ‘KEPPLERUS’. It is similar for other 
languages. ‘PETRARCH’ is listed under ‘PETRARCA’. 



4 Optical Character Recognition 

Several OCR packages were tested at the time (1990) but none were very satisfactory. 
As the catalogue pages had a clear structure as shown in Fig. 2(a) it was decided to 
make use of this and write our own software which is described by Anderson [4] in 
his M.Sc. thesis. Each entry has a clear starting point, either a string of capitals fol- 
lowed by an elongated hyphen or just an elongated hyphen. Consequently each entry 
can be easily extracted. Each page consists of about 4800 characters, excluding 
spaces, and each line has 48 character positions. There are 84 lines in each column, 
including the blank lines before each new author. 

The objective of the OCR and associated processing is to extract each entry from 
the image and to transform it into the form shown in Fig. 4(b). Template matching 
was used. The templates were selected by a human operator and there are more than 
300 of them. However using a state machine shown in Fig. 5 during the recognition 
process it was possible to split the templates into 8 sets corresponding to the A, F, E, 
T, C, D, L fields in Fig. 4(b) and the Series field. This speeded up the process consid- 
erably and probably improved the recognition rate. The number of templates matched 
can be reduced, by about 70%, by filtering based on the width and height of a charac- 
ter. Of course the segmentation problem was difficult and some progress was made. 
The OCR program was written in C and initially run on the now extinct Inmos T800 
transputer. The success rate was 96% on a small sample of 9 pages. It will be appreci- 
ated that counting errors is very tedious. As there are about 4800 characters on each 
page this indicates that there are about 240 errors per page. 

ZEISING (A.) — Neue Lehre von den Proportionen des 

(a) menschlichen Korpers. 

Leipzig, 1854. 8 °. T. 00. II. 

(b) P: 1130R 

H: ZEISING(A.) 

A: ZEISING 
E: A. 

E: 

T: — Neue Lehre von den Proportionen des menslichen Korpers 

C: Leipszig 

PR: 

D: 1854 
X: 

L: GERMAN 
S: T. 00 . 11. 



Fig. 4. (a) 1872 Printed Catalogue entry (b) Entry after processing. 
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Trying to manually edit 5000 pages with 240 errors is out of the question and con- 
sequently an OCR output corrector was written and is described in the next section. 
The scanning of the 5121 pages was done in the summer of 1991 using a Microtek 
3600 legal size scanner. Improved OCR software was transferred hack to a PC. As the 
OCR was slow this was carried out simultaneously on multiple PCs. 




Not 



*1: newline, not m space 
*2: char italic and start newline 
*3: newline large space before char 



Fig. 5. State machine. 



Table 1. Percentage OCR errors from a 50 page set. 



Substitution 


57.1 


Insertion 


4.64 


Deletion 


1.25 


Transposition 


0.06 


Run-on words 


6.8 


Split words 


1.8 


Noise 


1.0 


‘h r to ‘ri’ 


4.28 


‘ 1 o’ errors 


6.32 


Dot within word 


2.26 


Number substitution 


2.74 


Wrong Guesses 


3.64 


Indecipherable 


7.64 


Miscellaneous 


0.47 



5 OCROC - An OCR Output Corrector 

Because of the numerous errors on each page it was decided to write an OCR Output 
Corrector (OCROC) which would be as automatic as possible and would facilitate 
interactive correction. The OCROC program [2] was written in the Spitbol [8] imple- 
mentation of Snobol 4. A further error analysis was carried out on a set of 50 pages. 
This yielded 1,364 errors or 273 errors per page which is somewhat more than on the 
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earlier sample of nine pages. The recognition rate is about 94%. Table 1 lists the per- 
centage occurrence of the OCR errors. About 16% of the errors were language inde- 
pendent and could be corrected automatically. Examples are: ‘irl’ -> ‘ri’ which oc- 
curred in ‘prrlests’ corrected to ‘priests’ and ‘chirlstophorum’ corrected to 
‘christophorum’ . To attempt to correct the remaining errors it was desirable to know 
the language of the title. It was not feasible to use a conventional dictionary as many 
of the words would not be found in it. It was found that the number of word types 
increased approximately linearly with the number of word tokens. Zipf’s Law [9] 
does not hold for words in the catalogue. 

5.1 Natural Language Recognition 

Suitable dictionaries for the languages used in the catalogue were not available a 
priori. The entries contain proper nouns, archaisms and variations in dialect and some 
of the titles are quite short. Initially it was decided to use unique function words for 
each of the languages. English, Latin, French, Spanish, Italian, German, and Dutch 
were initially chosen. It quickly became clear that function words alone would not be 
effective. For example, consider: 

Pratique de la geometric 

‘de’ and ‘la’ both occur in more than one language so this would be unrecognised. 
Inflectional morphemes such as ‘-ique’ were then included in the language tables. 
This method of recognition only yielded a success rate of 70% when tested on 100 
pages. The recognition rate improved as more words and morphemes were added but 
saturation was reached at 70 entries. Improved results were obtained by allowing 
duplicate function words and morphemes and by assigning a score of 1 for each lan- 
guage found and a success rate of 88% was achieved on the same 100 pages. Analys- 
ing the effect of function words alone and suffixes alone it was found that the latter 
were less effective so a weighting scheme which favoured function words was intro- 
duced. It was found that assigning a weight of 2 to a unique function word in the title 
and 1 to suffixes and words which occurred in more than language was best. A suc- 
cess rate of 90.35% was achieved on the 100 pages [5]. 6.06% of the entries were 
marked as unrecognised. Table 2 shows the result for the title: 

trad, en Franc, par laques de Miggrede 



Table 2. Recognition of Language of Title. 



Word or suffix 


Language(s) 


Weight 


En 


Dutch, French 


1 


Par 


French 


2 


De 


Spanish, Dutch, Italian, French, 
Portuguese 


1 


-ede 


Dutch 


1 



The score for French is 4 and for Dutch 3. The unweighted algorithm would have 
been unable to differentiate between French and Dutch. Now having the language of 
90.35% of the entries correctly recognised it was possible to build up dictionaries 
from the titles themselves and a success rate 99.3% was obtained. 
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Knowing the language of the entries enables a search to be made based on the lan- 
guage. It also enabled errors in those languages to be discovered. An approach is 
described in Nic Gerailt and Byrne [10]. This used trigrams generated for English, 
French and Latin from corpora obtained from the Internet, a dictionary built up from 
the catalogue from correct words and heuristic rules. Ignoring split words which 
would have to be processed in other ways over 80% of incorrect words were detected. 



6 Indexing 

The entries are indexed on eight fields: Author, forenames, author description, title, 
place of publication, date, series and shelf number. Two and three letter words and the 
first four characters of longer words are indexed. These are used to compute the hash 
key. The complete word is stored in the file. This allows for both stem searches based 
on four or more characters including complete words. An is used as a wildcard 
and sometimes this can overcome language problems as well as OCR errors. A search 
using ‘math*’ will retrieve entries in English, French, Italian, German and Latin. But 
looking for all books on houses presents a difficulty. ‘House’ is ‘maison’ in French, 
‘bans’ in German, ‘casa’ in Italian and Spanish. 

There is a separate index file for each of the fields which are quite different in na- 
ture. Hash coding is used to create the index on disk and the formula is based on that 
used by Smith and Devine [11] in the Queen’s University Belfast Microbird System. 
It is a radix transformation function based on the radix 7. The formula is: 

Key = Integer(S(letter(i)-20)*7^(3-i), i=0 to 3/blocksize)+l 

Radix transformation aims to create a random distribution of keys from a clustered 
and non-uniform set of keys. This cannot be completely achieved and there is provi- 
sion for overflow. The separate index file for each field allowed for more efficient use 
of disk space and it enables convenient searching in described section 7. 



7 The User Interface 

The User Interface was built using the Delphi GUI builder and is shown in Fig. 6 
which shows the result of a search for the works of Pertrarch which will be discussed 
below. The principal objective of the interface was that it must be easy to learn, easy 
to use and easy to remember how to use otherwise it would not be used by non- 
computer literate reader. The entries in the Catalogue are very diverse and readers 
consequently would come from a wide variety of backgrounds. Pull-down menus are 
used which cover all options. The options which are likely to be used most frequently 
are indicated by the icons on the bottom of the screen. The function of some of these 
icons is described in the following sections. 

7.1 Search on Author 

The form is shown in Fig. 7. In searching for an author it is possible to use just a sur- 
name but in the case of a common name such as ‘SMITH’ it is not of much use. A 
first name may also be used. Many author headings include a description of the author 
such as Duke, Bishop, S.J. etc. As usual the descriptions in the Catalogue are in the 




24 John G. Byrne 




Fig. 6. The User Interface. 



language of the entry. Many of the initials are expanded by OCROC to make them 
easier to search and all are in English. For example ‘S.J.’ is expanded to Jesuit but it 
is also possible to search on the abbreviations. A search on ‘S.J.’ yields 1947 entries. 
If a full author name yields no results it is possible to use a wild card. For example a 
user might be unsure whether ‘PETRARCH’ is in English, Latin or Italian. It is actu- 
ally in Italian (Petrarca) but it is wise to use ‘PETRARC*’. This yields the screen 
shown in Fig. 6 above. The search produced 39 entries. The selected title is listed in 
the box in the middle of the screen. The top entry has been selected and the actual 
entry from the Catalogue is displayed in the box below. This is intended to give con- 
fidence to the user and it has the major benefit of hiding OCR errors, if any! The ac- 
tual post-processed OCR entry is shown in Fig. 8. This is not correctly laid out but it 
does contain all the information. 



Search by Author Name 



Aulho) Surname |* 



1 



Aulhof failname |* 



Aulho) Ottci^bon I 

[ QK ] jitnxt I Oeai [ £*»£«< | 



Fig. 7. Search by author name. 
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H: PETRARCA (Franciscus) 

A: PETRARCA 
F: Franciscus 
E: . 

T: _Opera. 

Basileae, 1554. fol. 2 tom. 

Basileae, 1581. fol. 

C: Venetiis 
PR: 

D: 1416 
X: 

L: LATIN 

S: Fag. D. 1. 14.BB. aa. 30.EE. a. 43, 44.TT. b. 23. 

Fig. 8. The postprocessed entry for the first Petrarca entry. 

As many of the author names are in Latin it is possible to latinize a name after an 
unsuccessful search. A search on ‘CARDANO’ is unsuccessful but the latinized ver- 
sion ‘CARDANUS’ yields 23 entries. There are sometimes secondary entries for 
author names. A search on ‘JOHN NAPIER’ gives the response: 

Vid. Joannes NEPERUS 

The ‘Indirect Search’ will place NEPERUS in the author name field. The resulting 
search yields 9 entries including his books on logarithms. 

7.2 Search on Title 

By far the most useful field to search on is the title field. The form is shown in Eigure 
9. As is clear from the examples shown words are implicitly combined with a logical 
‘and’. The use of the wildcard is very important as it allows four letter stems to be 
used. This is particularly important for entries under A and B as the font used in this 
volume is different from that used in the other volumes and the OCR is poor. The A-B 
volume was printed 19 years before the first of the others. 

Apart from overcoming the deficiencies of the OCR searching on words from the 
title is the principal advantage of the 1872 Online Catalogue. The Herapath magazine 
mentioned in the opening paragraph was found in this way and it is clearly the only 
realistic way to look for items listed under terms such as ‘IRELAND’ (169 occur- 
rences), ‘PARLIAMENT’ (843 occurrences), ‘MAGAZINE’ (246 occurrences), 
‘JOURNAL’ (1066 occurrences) and ‘BIBLIA’ (803 occurrences). 

7.3 Combination Search 

A Boolean search query may be formulated using the Combination form shown in 
Eig. 10. 

Recently a small collection of Syriac manuscripts have been fully catalogued. To 
find books in Latin with ‘TITLE= Syriac’ yields 116 responses. To reduce this to 
books about Codices the search shown in Fig. 10 was made. It yielded six books. 

The date range can be specified. A search on ‘TITLE=AGRICU*’ from 1700 to 
1799 yielded 7 books one of which was wrong due to an error in the date recognition. 
A search on ‘DATE =1470’ returns 8 books. There is one older book in the Library 
(1469) but there is no date in the catalogue entry for it. 
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Fig. 9. Title searching form. 




Fig. 10. Combination search showing a search for hooks about Syriac codices. 



7.4 Other Features of the User Interface 

If the incorrect image appears it is possible to bring up the OCR entry as shown, for 
example, in Fig. 4(b) by clicking on ‘View Entry’. Sometimes all of the relevant im- 
age may not appear. Clicking on ‘View Image’ displays a larger surrounding image. 
The arrows to the right of ‘View Image’ allow one to display preceding and following 
images. 

8 Conclusion 

The system described in this paper provides access to the Catalogue of books in the 
Old Library in a way not hitherto available e.g. one can search on words in the title. 
The OCR is not perfect but this has been overcome to a significant extent by auto- 
matic correction of errors (OCROC), by being able to search on four letter stems and 
by providing a variety of methods of searching and viewing images. The display of 
the entry image helps to give confidence to the user and gives the right shelf number, 
especially when this has been overwritten in manuscript. Even the A-B entries, for 
which the OCR is particularly bad, can often be discovered by approaching the search 
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in several ways. The system is used on a daily basis in the Department of Early 
Printed Books. It requires about 2 GB of disk space and retrieval performance is very 
satisfactory even on an Intel 386 PC. A zip file of images of the pages of the Supple- 
ment can be found at http;//www.cs.tcd.ie/John. Byrne/ 
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Abstract. In this study, we outline computational issues in the 
design of a Digital Library (DL) for Indie languages. The complicated 
character structure of Indie scripts entails novel OCR analysis techniques 
and user interface (UI) designs. This paper describes a multi-tier 
software architecture, which provides text and image processing tools as 
independent, reusable entities. Techniques for measuring and evaluating 
different stages of an Indie script recognition engine are outlined. 



1 Introduction 

DLs have made possible the global access of content, once limited by physical 
location, transportation and cooperation of artifact owners. Researchers have 
been showing interest in developing recognition and DL tools for Indie scripts [3, 
5,6,9,10,17,20]. Internationally acclaimed literary masterpieces, and centuries 
of cultural heritage lie hidden in these languages and scripts. However, this 
interest in OCR/DL systems for Indie scripts is not supported by a matching 
infrastructure of standardized data sets, or OCR analysis tools. Data access for 
non-Latin scripts like Indie is a frustrating experience for many users. In this 
paper, we examine design issues, and suggest techniques for developing Indie DL 
architectures. User-friendly data entry, multi-lingual data display, and storage 
are also outlined for Aryan and Dravidian scripts, with possible extensions for 
Arabic. Section 2 describes the characteristics of Indie scripts, and technological 
challenges posed by them. Section 3 demonstrates the architecture developed to 
support Indie DLs, with current and future applications suggested in section 4. 

2 Indie Scripts 

The Indian subcontinent has 18 scheduled (official) languages, which correspond 
to 11 scripts^. These scripts are divided into three categories: (i) Aryan: 
Devanagari, Bengali, Gujarati, Oriya, and Punjabi; (ii) Dravidian: Kannada, 
Malayalam, Tamil, and Telugu; and (iii) Arabic (Figure 1). All Indie scripts are 
inflective, i.e. elements from their alphabet set combine to form new characters 
by shape variation based on context. 

^ Source: Central Institute of Indian Languages 
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Fig. 1. Indie consonants: Aryan scripts - Devanagari, Bengali, Gujarati, Punjabi, and 
Oriya; Dravidian scripts - Tamil, Telugu, Kannada, and Malayalam, and Arabic script 



Devanagari is a syllabic-alphabetic script derived from Brahmi, and provides 
written form to approximately forty eight languages including Hindi, Konkani, 
and Marathi [11, 13]. Each Devanagari character represents a Syllable: a unit of 
pronunciation that has a stop (coda) at the end. Forming Syllabic characters from 
Alphabets is a representative feature of Syllabic- alphabetic scripts, which include 
Devanagari, Telugu, Kannada, Bengali etc [13]. Devanagari is an inflectional 
script: its’ base alphabet forms (Table 1) combine to give new characters. For 
example: The compound character kya («T^ ) is made by combining forms of ka 
(^) and ya (^) (More examples in Table 2). Devanagari characters are joined 
together into words by a horizontal line, called the Shirorekha. 

Telugu is a Syllabic-alphabetic script of Brahmic origin, and provides written 
form to three languages. In tune with other Syllabic-alphabetic scripts, Telugu 
characters are composed of syllable units ending with a stop, and contain 
conjunct alphabet forms. However, Telugu is a Dravidian script, and has a few 
distinctive characteristics: there is no Shirorekha connecting the characters of a 
word and the consonant half-forms of compound characters do not touch each 
other (Table 1). 

The Devanagari and Telugu alphabet set (Table 1) shows that many 
consonants and vowels have half- forms and vowel modifiers. A Devanagari/ 
Telugu word (any Indie script word) can be split into characters (syllables), 
which are themselves a combination of consonants, vowels, half-forms, modifiers 
and special symbols. Using the alphabet set, and rules for syllabic-alphabetic 
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Table 1. Devanagari and Telngu character set (a): Consonants (left) and their half- 
forms (right), (b): Vowels (1) and Vowel modifiers (r), bottom: Numerals. Note: Special 
character forms are not shown 
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character formation, words are parsed into components using production rules 
called script writing grammar (Examples in Table 2). 

2.1 Challenges and Solutions for Indie DL 

Our initial goal for Indie DL has been to enable semi-automatic truthing^ of Indie 
documents, create public domain OCR data sets, design tools for word and sub- 

^ Truthing refers to the collection of ground-truth data for training or testing learning 
algorithms. 
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Table 2. Top: Hierarchical splitting of characters into glyphs and components. Bottom: 

Separation of a few Devanagari words. The characters J <1 1 vr^Tl 
are samples of compound characters, formed by combining one or more consonant (s) 
with a vowel modifier. Some characters (ft) have strokes on the top, called ascenders 
and others have strokes at the bottom, called descenders 
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word querying, and provide multi-lingual annotations. Challenges encountered 
in the creation of these resources are outlined in this section. 

Text Encoding: Though Indie scripts have a small alphabet set, each language 
has a multitude of character forms, resulting in complex shapes on digital 
output. A number of independent bodies have devised encoding schemes, with 
each providing a set of codes for the alphabet set, and a corresponding set of 
parsing rules to create character forms from sequences of alphabet codes [11, 
21]. Our system uses Unicode^, which provides a hexadecimal encoding scheme 
for all major scripts. Unicode has rules to generate inflectional characters with 
variations like compound forms, ascenders, descenders etc. Alternate schemes do 
not have the compatibility and global acceptance of Unicode [1, 8, 19, 22]. 

Multi-lingual Text Input: All modified Indie character forms cannot be 
accommodated on a QWERTY keyboard: Typing Indie script data necessitates a 
combination of keys to enter the entire repertoire of characters. The INSCRIPT 
standard [21] maps keys of a QWERTY keyhoa,rd onto Indie alphabets. Control 
keys {alt, shift) are used to input corresponding modified forms. Alphabets are 
grouped in phonetic order, with the keyboard layout designed for enhanced 
speed. Another technique, transliteration represents characters of an inflective 
script by phonetic equivalent Latin strings. For example: The Devanagari 

(Telugu) character ^ ) would be entered using the Latin equivalent ka 

(Figure 4.1. a). Both systems rely on phonetic characteristics of Indie scripts. 

® Unicode is an official implementation of ISO/IEC 10646. 
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Barring a few exceptions, different Indie scripts have similar INSCRIPT layouts 
and transliteration schemes. Tests conducted using 14,225 words from 34 
documents reveal that on an average, the INSCRIPT layout needs 13% fewer 
key presses than transliteration. However, INSCRIPT keyboards have to be 
physically labeled before a user can easily associate keys with relevant characters. 
Typing multi-lingual documents on an unlabelled keyboard entails a steep 
learning curve. We use the transliteration scheme^ for multi-lingual data input. 
A GUI keyboard is also provided to enter special characters (Figure 4.1.b). 

Platform Independent Representation: Indie script data is stored in 
XML, a versatile and preferred scheme for DL projects [1,2,8]. The platform 
independent representation allows other researchers to reuse our data, and 
design in-house parsers for information extraction. The implementation provides 
converters to generate HTML content from XML truth files. 

Multi-lingual Text Display, Annotation and Meta-data Addition: 

Unicode fonts and Java style attributes enable multi-lingual text/annotation 
display within the same text window (Figure 4.1.c). The system provides controls 
to select sections of transliteration text, and specify it’s target script. Extensions 
to XML Data Type Definition provide the flexibility to include additional meta- 
data like author information, genre etc in truth files. The current implementation 
supports Arabic, Devanagari, English and Telugu. 

Image Processing: Given that images are an important information source 
for DL systems, we incorporate data extraction from gray-scale documents. The 
system supports text-image separation, word boundaries estimation and OCR 
of Devanagari documents [14]. Applications for image analysis and querying are 
described in later sections of this paper. 

3 DL Design 

DL projects often reveal a combination of UI design, file systems and image 
processing tools. Kanungo et al. integrate page segmentation analysis [18] with 
a visualization toolkit in TrueViz [16]. Allen et al. [2] use a combination of 
OCR and meta-data extraction to generate XML representations of newspapers. 
ATLAS [7] outlines an annotation graph model for linguistic material (text, 
image and video). Couasnon et al. use an automated web-based system for 
collecting annotations of French archives [12]. 

These tools are oriented for use in a specific domain (OCR, Document Layout, 
Newspaper DL), or are targeted towards non-Inflective scripts similar to Latin. In 
our DL, the features outlined in section 2.1 are implemented as modular solutions 
independent from target applications. This section outlines the grouping of these 
features into three modules that are being used as building blocks in different 
applications. 

^ We use ITRANS transliteration. URL:http://www. aczone.com/itrans/ 
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UI, Image and File Processing (UIP): The UIP contains routines that 
process image and truth files, convert files from one format to another, and create 
the UI. In effect, this module encapsulates Platform Independent Representation, 
and Image Processing operations of the DL. 

Text Entry and Display (TexED): The TexED controls Text Encoding, 
Multi-lingual Text Input and Multi-lingual Display mechanisms outlined in 
Section 2.1. TexED’s functional mapping can be shown as: 

fTexED{transliterationJext, GU I _keyboardJnput, language) 
{transliterationJ.ext, Indicscript, system-messages). 

The domain (transliteration text, GUI keyboard input, and target Indie script) 
and output (transliteration text, target Indie script generated by parsing input 
transliteration, user prompts) of this mapping is defined by TexEd. It is possible 
to design multiple applications interfaced with a common TexED. For example, 
the OCR engine and annotation tool described in Section 4 use the same TexED 
for providing text input and display. The TexED currently supports Devanagari, 
Telugu and English. 

Text Processing (TexP): Routines in the TexP unit are used to test linguistic 
post-processing algorithms. Currently, character and word n-gram calculation is 
implemented in the TexP unit. 

4 Applications 

Several applications have been developed using the UIP, TexED and TexP 
outlined in section 3. We continuously upgrade the DL design based on research 
inputs and have listed a few immediate concerns in this section. 

4.1 Data Collection for Inflective Scripts 

This architecture supports data collection for Indie script OCR [15]. Indie OCR 
necessitates splitting words into smaller parts, though opinions vary on the 
level of detail (Table 2 shows different levels of separation). Statistical [5,14, 
17] techniques extract representative feature vectors from character images and 
apply pattern classification techniques on these features. To reduce the number 
of classes, structural information is used to split each character into smaller 
components; leading to higher granularity. Structural techniques [9, 10] classify 
characters on the basis of geometric attributes, relying on coarse characteristics 
at first, and classifying sub-groups using detail features. In this case, any loss 
of structural information is avoided by retaining the entire character (leading to 
coarse granularity) . Providing data sets with different levels of character splitting 
allows robust testing of different OCRs: Structural classifiers could use words and 
characters, which have not been segmented into smaller parts; whereas statistical 
techniques may be tested with alphabet and component images. 

Our data sets are available as XML files containing links to the document 
image (300 dpi, gray-scale, tiff files), bounding-box and Unicode values of 
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Fig. 2. Parsing a word on qnery: Extracting component truth (Left), Extracting 
component boundaries (right), Combining the two results to obtain component images 
(bottom) 



words, and linguistic data. A Devanagari query system has been provided 
to extract character images at multiple levels of granularities (Figure 3.2). 
The query system parses each Unicode word according to the script writing 
grammar (Table 2), identifies each component, and stores them as component 
truth. Character segmentation algorithms [14] are used to compute the list of 
component boundaries in each word image. Component truth and component 
boundaries are combined (Figure 2) to obtain results of each query. The 
query system allows users to prune results, extract features, store the images 
(query results) into a directory, or pass extracted components to a recognizer. 
Recognition results are displayed on the query panel using color codes. 

To ensure data accuracy, we use two verification steps: we examine the truth 
value and OCR segmentation results of each word in a single interface, and 
flag truthing or segmentation errors(Figure 3.1). On an average, each truth file 
contains 445 words and needs 22 minutes for verification. Creation of initial truth 
is more expensive and involves the typing of transliteration text with occasional 
correction of word boundaries (average 69 minutes for a 400 word document). 
Verification of 1800 words revealed 2 truthing errors(less than 0.01%). Words 
flagged in the verification process are observed in a tree view structure to identify 
reasons for segmentation failure and correct truthing errors, if any. Currently, we 
have around 50,000 machine printed Devanagari word images and plan to add 
another 150,000 words. We are also gathering handwritten Arabic documents 
for a database of handwritten words. The data and associated querying tools 
are available for free download from our web-site®. 

4.2 OCR Analysis 

The Devanagari query system is integrated with a tree view for OCR analysis 
(Figure 3.2, 3.3). Errors in query output indicate possible failures of the 
segmentation algorithm. If reasons for an error are not apparent, the tree 

® URL: http://www.cedar.buflalo.edu/ilt 
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Fig. 3. 1: Segmentation and truth evaluation; (a) Word truth and segmentation results, 

(b) Document view. 2: Query Panel; (a): Query results, (b,c): Transliteration and 
display for query input, (d): Query selection panel. 3: Script Analysis window; (a): 
Breakup of image into components, (b): Breakup of image transliteration and truth, 

(c) : Tree view of document 
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Fig. 4. 1: Tmthing and annotation interface; (a):Transliteration inpnt, (b): GUI 
keyboard, (c): Annotation and script words, (d): Image with word bonndaries, (e): 
Tool-bar for language selection, (f): Document Truth. 2: XML Truth of a document. 
3: Annotations represented in HTML 



structure is examined for a detailed view of word segmentation. This combination 
of a tree view and a query system allows users to test both segmentation and 
recognition results. Provision is being made to plug-in and test user-defined 
segmentation and recognition algorithms. 

4.3 Annotation 

Using the TexED and UIP modules, we have designed an interface to annotate 
passages of text in one language with meaning/description in other languages. 
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The UI allows users to input multi-lingual data with a common tool, and 
an unmodified QWERTY keyboard. Text and annotations are rendered in 
appropriate scripts, providing users prompt and natural feedback. Annotated 
documents can be saved as HTML with annotations being displayed when the 
mouse hovers over relevant text (Figure 4.3). 

Annotation features of the tool are being used by the LiTgloss project® 
at the University at Buffalo. The project provides a collection of literary or 
cultural masterpieces, written in languages other than English, and annotated 
to facilitate comprehension by English-speaking readers. 

5 Discussion and Future Work 

This work examines prevalent text entry and display techniques and outlines an 
optimal use of technology to support Indie script DLs. We provide multi-granular 
data sets for Devanagari OCR analysis. Indie scripts have similar script writing 
grammars, and efforts are on to design OCR evaluation and querying systems for 
Telugu. Given the inflective nature of Indie scripts, it is likely that certain OCR 
errors (recognition) affect the “readability” of an output document to a lesser 
degree than segmentation errors. Suitable error metrics for these variations are 
being explored. 

In future versions of our dataset, we plan to include information like 
document skew, document degradation parameters etc with our OCR data set. 
Analyzing artificial degradation algorithms used in Latin script documents [4] 
and developing similar techniques for inflective scripts is also being considered. 
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Abstract. The decreasing cost and the increasing availability of new 
technologies is enabling people to create their own digital libraries. One 
of the main topic in personal digital libraries is allowing people to select 
interesting information among all the different digital formats available 
today (pdf, html, tiff, etc.). Moreover the increasing availability of these 
on-line libraries, as well as the advent of the so called Semantic Web 
[1], is raising the demand for converting paper docnments into digital, 
possibly semantically annotated, documents. These motivations drove 
us to design a new system which could enable the user to interact and 
query documents independently from the digital formats in which they 
are represented. In order to achieve this independence from the format 
we consider all the digital documents contained in a digital library as 
images. Our system tries to automatically detect the layout of the dig- 
ital documents and recognize the geometric regions of interest. All the 
extracted information is then encoded with respect to a reference ontol- 
ogy, so that the user can query his digital library by typing free text or 
browsing the ontology. 



1 Introduction 

The main goal of our system, OntoDoc, is to allow users to query their own 
personal digital libraries in an ontology-based fashion. An ontology [2] specifies 
a shared understanding of a domain of interest. It contains a set of concepts, 
together with its definitions and interrelationships, and possibly encodes a logical 
layer for inference and reasoning. Ontologies play a major role in the context of 
the so called Semantic Weh [1], Tim Berners-Lee’s vision of the next-generation 
Web, by enabling semantic awareness for online content. 

OntoDoc uses a reference ontology to represent a conceptual model of the 
digital library domain, distinguishing between text, image and graph regions of 
a document, providing attribute relations for them, like size, orientation, color, 
etc. 

In order to classify a document, OntoDoc performs a first layout analysis 
phase, generating a structured, conceptual model from a generic document. Then 
the conceptual model goes through an indexing phase based on the features 
in the model itself. Finally, the user can query his digital library by typing 
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free text or through composition of semantic expressions. The query system 
is particularly suited to express perceptual aspects of intermediate/high-level 
features of visual content, because the user does not have to bother thinking in 
terms of inches, RGB components, pixels, etc. Instead, the user can query the 
system with higher level, although accurate, concepts (e.g. medium size, black 
color, horizontal orientation etc.). 

In Section 2 we explain the layout segmentation phase. In Section 3 the ref- 
erence ontology is detailed. In section 4 the query system, as well as an example, 
is presented. In the final section we discuss conclusions and future works. 




Fig. 1. The System window, with a matched digital document on the right. 



2 Layout Analysis Architecture Overview 

Our system [3] uses a split and merge technique similar to the approach that 
has been obtained by Nagy’s X-Y cut algorithm, but instead of working top- 
down, we use the recognized horizontal and vertical lines to cut the image into 
small regions, we then try to merge from bigger regions, using a quad-tree tech- 
nique and image processing algorithms. The system is based on these different 
phases in order to perform document classification in a modular and efficient 
way. The system architecture includes four main components: the preprocessor 
(1), the split module (2), the merge module (3) and the classification module 
(4) . All these modules can be grouped in a Layout Analyzer module which takes 
a document image as input and outputs a structured, conceptual description of 
the regions contained in the document. Actually the “brain” of our system is 
the classification submodule of the Layout Analyzer module, which outputs the 
structured model of the digital document, i.e. segmented regions together with 
their attributes encoded in an ontological format. 

2.1 Preprocessing 

The pre-processing phase performs two steps: it loads the scanned image in main 
memory and computes the gray-level histogram extracting the three parameters 
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discussed below. This approach is due to the fact that we want to reduce the 
amount of computations in the preprocessing phase obtaining a quick response 
method. Starting from a 256 gray level document we compute the RIl parameter, 
as the maximum value in the first half of the gray-level histogram. Then for the 
RI2 parameter we compute it in the same way but on the other half of the 
histogram, thus considering the whitening colors. Finally, the last parameter 
RI3, which represents the point of separation between background and text, is 
the minimum between the RIl and RI2 parameter. 

2.2 Split 

During the split phase, the whole image of the document is split according to 
the extracted vertical and horizontal lines as well as the boundaries of recog- 
nized images [4, 11]. This results in many small zones (block sizes are within a 
range depending on the size of the image) . We use a quad-tree decomposition to 
perform spatial segmentation by assigning a condition by which nodes are split, 
which is based on the average block sizes. In order to choose whether to split or 
not a region we use the mean and variance values for that region. If the variance 
is low compared to the entire document, probably the region is an image, because 
both characters and graphs have a high variance since they usually don’t have 
smooth colors (i.e. the foreground color is very different from the background). 
After this step, labels are assigned to regions in order to pre-classify them. This 
pre-classification is useful to pass information to the merge phase. We define 
three classes of regions: text-graph, image, background. In case of low variance 
and low mean values we label the region as background, instead if we have high 
variance and low mean we label the region as text-graph, otherwise the label is 
image. 

2.3 Merge 

The split operation results in a heavily over-segmented image. The goal of the 
merge operation is to join neighboring zones to form bigger rectangular zones. 
The first phase of merging consists of connecting neighbor regions with the 
same pre-classification value. Using only pre-classification we don’t have all the 
information we need, but with this approach we follow one of the targets of our 
method, the computing performance efficiency. The second step of the merging 
phase is the Union phase. The Union procedure is used to enhance the pre- 
classification results. First of all, the regions, which are in to the external edges 
of the document, are removed, then all the other regions are considered for the 
further phase. We now group all the Macro- Regions, as those regions with a 
spanned area greater than a threshold, which is based on the average region 
sizes. All the adjacent Macro-Regions with the same pre-classification values are 
merged thus obtaining our segmentation. Then we introduce the 0 < P{Ci\M) < 
1 as the estimation of the conditional probability for the given Macro-Region M 
of belonging to the class Ci, where jCj = 3 and C = {Text, Graph, Image}. 
Let |M| = m be the total number of subregions of M, and the number of 
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subregions of M with pre-classification {text — graph} and variance highest than 
the average variance overall the sub-regions of M with pre-classification {text — 
graph}; let me be the number of subregions of M with pre-classification {text — 
graph} and variance lowest than the average variance overall the sub-regions of 
M with pre-classification {text — graph}. Let mj be the number of subregions 
of M with pre-classification {image}, then: P(C'o|M) = P{Ci\M) = 
P{C 2 \M) = which are respectively the probability of a Macro- Region M of 
belonging to the class: text, graph or image. The Macro- Regions are labeled as 
belonging to the class according to the highest probability as defined above. After 
that, the system produces an OWL (Ontology Web Language^) description of 
the Macro- Regions, which maps the digital document in input to our conceptual 
model for digital libraries, i.e. the reference ontology illustrated in the next 
subsection. At the moment, the mapping is manually defined, but the application 
of a method based on semantically-aware features is in progress. 

The structured model obtained after the classification will contain different 
instances of region elements depending on the classification results. This infor- 
mation will be formatted in OWL. An example of an OWL file produced by our 
system follows: 

<?xml version=" 1 . 0"?> 

< ! — OWL snippet — > 

<Text_Region rdf : ID="text_regionl"> 

<f ont-size> 

<Medium_Size rdf : ID="mediuin_sizel" /> 

</f ont-size> 

<has-orientation> 

<Horizontal_Orientation 
rdf : ID="horiz_orientl"/> 

</has-orientation> 

<text-color> 

<Black rdf :ID="blackl" /> 

</text-color> 

</Text_Region> 



3 The Reference Ontology 

The reference ontology encodes all the required information about the digital li- 
brary domain. Documents and Regions are represented as resources with a num- 
ber of attributes in common, e.g. Orientation, Size, etc. The ontology encodes 
specific relations between subconcepts of Region, i.e. Text, Image and Graph 
Regions, and several attribute concepts, like Color, Size, Orientation, Reading 
Direction and so on (a snapshot of the concept taxonomy is shown in Figure 2). 

Encoding ontological concepts instead of numerical attributes is a peculiar 
feature of our system. The user can think of concepts instead of low level mea- 
sures (e.g. inches, pixels etc.) and submit queries like “all the documents with 

^ http:/ /www. w3.org/TR/owl-features/ 
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a text region in the south part and an image region with 2 columns having an 
inner table”. The query is then translated in OWL and matched against the 
document base. The user benefits from the use of a reference ontology in that 
the annotated document base can be put online and queried by other users re- 
ferring to the same or another ontology if a mapping is provided. In order to 
facilitate this task, we mapped each concept of our ontology to WordNet [5], a 
de facto standard lexicalized ontology containing more than 120,000 concepts. 
WordNet encodes each concept as a set of English synonyms, called synset. This 
allows our system to accept free text queries and automatically map each key- 
word to a concept in the reference ontology. For the Italian language we used 
Multi WordNet [6], an Italian version of WordNet. 



©Entity* 

<? ©Attribute* 

©■©Color* 

©■ © Size 
9 ©Orientation* 

© Horizontal Orientation 
©Vertical Orientation 
9 ©Reading Direction* 
©Lett To Right 
©Right To Left 
©■©Language* 

9 ©Resource* 

©Document 
9 ©Region* 

©Text Region 
© Graphic Table Region 
© Image Region 



Fig. 2. A portion of the reference ontology for digital libraries. 



4 Querying a Personal Digital Library 

OntoDoc allows users to query their own digital libraries either by composing se- 
mantic expressions through ontology browsing or by typing some text in natural 
language. 

The first option allows the user to browse the ontology and instantiate con- 
cepts of kind Resource, i.e. Document, Text Region, Graph Region and Image 
Region. For each instance (left frame in the left window of Figure 1) the user can 
fill the relation range slot with attribute instances, namely size, colors, reading 
direction, language, orientation etc. (right frame in the left window of Figure 1). 
Notice that these are not quantities, but instances of concepts, so for instance 
the user will not choose the number of points for a font size, but he/she will 
instantiates the Medium Size concept^. 

^ This raises the problem of subjective or multicultural concepts. As an ontology is a 
formal specification of an agreed conceptualization within a community, a user who 
does not completely agree can redefine the meaning of one or more concepts (e.g., 
the user can change his/her meaning of “medium size” or an adaptive system may 
be built based on the user feedbacks). 
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The second option is an interface for building semantic queries through key- 
word typing. The interpreter either maps each keyword to a lexical item in a 
WordNet synset or marks it as unknown. In the first case, the interpreter trans- 
lates the WordNet synset to the associated concept in the reference ontology, 
otherwise it discards the keyword (see Figure 3 for an example). The user can 
also connect keywords with logic connectives like or, and, etc. Finally, the system 
instantiates each concept and fills in the gaps, formulating a semantic query. 




Fig. 3. The mapping steps from a keyword to a WordNet synset to a reference ontology 
concept. 



For instance, consider the query shown in Figure 3: “picture region with 
two columns AND text region in English” . The interpreter assigns a synset in 
WordNet to each meaningful word, obtaining the following string: “picture#! 
region#! with two#! columns#3 AND text#! region#! in English#!” (the 
number of the correct sense in WordNet is attached to each word). Then, the 
interpreter translates each synset to a reference ontology concept, obtaining: 
“/mage Region with 2 has-columns AND Text Region in EnglisK" (concepts and 
relations are marked in italic). Finally, concepts are instantiated and relations 
are associated with these instances. The resulting query is: image-region#!, text- 
region#!, has-columns (image-region#!, 2) and language(text-region#l, 
English). Notice that the result of either query composition by ontology brows- 
ing or natural language input is a semantic expression encoded in OWL and used 
by the system to query the document base and return the matching documents. 

5 Experimental Results and Conclusions 

We have tested our system over the UW-!! database that is the second in series 
of document image databases produced by the Intelligent Systems Laboratory, 
at the University of Washington, Seattle, Washington, USA. This database was 
particularly useful because it contains 624 English journal document pages (43 
complete articles) and 63 MEMO pages. All pages are scanned pages. 

Each document in the database has been taken from scientific journals and 
contains text, graphs and images. All the images were already annotated with 



A Semantic-Based System for Querying Personal Digital Libraries 



45 



labels for the region type (image, text, ...) and sizes. The experiments have been 
carried out in 2 phases: the first phase was to the test our layout analysis module 
over the UW database to verify the percentage of the automatic classification 
of the digital documents regions; while the second phase was performed on 10 
users in order to measure the ability of the system in helping them to retrieve 
documents. 

For the first experiment, concerning the classification abilities of our system, 
we have tested it over the entire database (600 images) obtaining an 84% of cor- 
rectly recognized regions, 14% of incorrectly recognized and 2% to be defined. Of 
course, errors at this stage affect the querying precision. A method to overcome 
this inconvenience is an ongoing work. The 84% of correctly recognized regions 
could be subdivided into a 59% of entirely recognized and a 25% of partially 
recognized, which means that some regions were assigned to the right class and 
some others not, for example a single text region was interpreted as two text 
regions (this usually happens in titles with many spaces). 

For the second phase, all tests have been carried out using the relevance 
feedback process by which the user analyzes the responses of the system and 
indicates, for each item retrieved, a degree of relevance/non-relevance or the ex- 
actness of the ranking [7]. Annotated results are then fed back into the system, 
to refine the query so that new results are more fitting. The experiment was im- 
plemented showing to the users 10 different documents and then asking them to 
retrieve the documents from the entire UW database using our query module. 
On the qualitative side, our system proved to be highly effective because the 
users concentrated on the conceptual content of documents rather than on nu- 
merical information about them, allowing faster and more accurate retrieval of 
the desired documents with respect to keyword-based non-ontological retrieval. 

A major improvement of OntoDoc may be in the classification phase. In fact, 
the system could classify shapes like subject images or specific geometry on the 
basis of their ontological descriptions (for instance, finding a document with an 
image of an apple, or with a pie-chart). 

Furthermore, mapping the reference ontology to WordNet could allow to 
make inferences like: bitmap image and accept bitmap region as input 

instead of image region. In the next version, we plan to include an inference 
system based on the rules described in [8] . This will improve the expressiveness 
of the natural language interpreter described in Section 4. 

Finally, we plan to use OntoLearn [9], a tool for ontology learning, to enrich 
the reference ontology with new concepts and relations extracted from a corpus of 
documents like the ones used for the ICDAR 2003 [10] page layout competition. 
Such a corpus will also be used to extend our experiments. 
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Abstract. The personalized digital library for providing users with 
proper information just in time (JIT) is proposed. This digital library 
can support not only document retrieval task, but also users’ reading 
and comprehension task. This function can be realized by recording the 
person-dependent information, which is about human behavior toward 
documents, such as highlighting, marking, annotating, etc. By utilizing 
the digital pen and paper with dot-pattern, handwritten annotations on 
paper documents can be simultaneously digitized and stored in the cor- 
responding e-document. Furthermore, hybridization of real world and 
e-world in digitizing personal annotations is proposed. This information 
on users’ annotations makes it possible to retrieve document based on 
the implicit memory and to provide information support appropriate to 
the situation with users. The personalized digital library proposed in this 
paper is to be developed. 



1 Introduction 

What do we search documents from Digital Library (DL) systems for? What are 
we going to do after retrieving documents? In general, we access DL systems in 
order to acquire information which is novel for us or more detail than what we 
have. We read the retrieved documents and try to comprehend the contents of 
them. After that, the documents are interpreted in some users’ manners, and 
they are memorized as our knowledge. In other words, our knowledge acquisition 
process consists of (l)to search or to retrieve, (2)to read and to comprehend, and 
(3)to organize and to store. 

Many of the DL systems which are currently available support the search and 
re-trieval of documents. Users of DL systems can retrieve documents relevant to 
their needs by providing keywords with DL systems, and browse retrieved doc- 
uments by referring document clustering results or summaries of documents [1]. 
Result of document retrieval is, however, usually independent of users, and or- 
ganization of the library is usually stiff. 

On the other hand, current DL systems didn’t seem to support users’ reading, 
comprehending and knowledge organizing task. It is because human knowledge 
acquisition is dependent on person, and on situation. Summary of a document, 
for example, is different from person to person. 
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The target of this research is to build up a personalized digital library, in 
other words, a personal information environment(PIE), which can support the 
reading, comprehension, and knowledge organization task by extending human 
memory. This personalized digital library stores not only documents but also 
human behavior toward documents, such as how problems are solved, and which 
part of the document is focused under current situation. By using this kind of 
information, the system can provide more appropriate information or sugges- 
tion, in advance, like which kind of query should be input, that is, substantial 
information “just-in-time.” 

As a first step toward the personal digital library, in this paper, we pay 
attention on annotations on documents, and consider a method of digitizing 
annotation for recording human action to documents. 

Following chapters describe personalization of digital library as a mean of 
extending human memory at first, hybridization of real world and e- world for 
making personal annotations computer-accessible as human actions to docu- 
ments, and for providing information JIT. 

2 Personalized Digital Library 
as Human Memory Extension 

There are two possible situations in accessing documents in DL; the first one is 
the general document retrieval based on similarity between a query and docu- 
ments in the document vector space, the second one is the subjective retrieval 
based on a user’s memory which is not only explicit but implicit. The precision 
of document retrieval by the user is sometimes higher in the second situation 
than in the first situation, because the user’s memory directly expresses how the 
user interpreted the documents and which document he/she associated one doc- 
ument with under some specific situation, which straightforwardly correspond 
to the user’s needs in similar situation. 

Personalization of DL is regard as to store user-dependent or situation- 
dependent information in addition to documents themselves. This means that 
human memory extension, in that human memory is expressed on the computer 
more explicitly. 

Now, what we do to comprehend documents is considered as follows. In order 
to remember the documents more deeply, we often make paper copies to read 
because paper is easy to glance over and suitable for careful reading [2]. Further- 
more, we (l)highlight passages by underscores and marks, (2)make annotations 
on the margins, (3)bookmark the pages, (4)take notes, (5)make summarization, 
(6)refer other information sources (documents, web pages), (7)file the document 
into a categorized folder, and (S)rearrange the hierarchical folder structure. 

Thus, annotation on paper documents (including underscore or mark) is one 
of key clues for revealing what the user thought, and how the user interpreted in 
reading and comprehension process. Therefore, to store information about an- 
notation as metadata of documents in the library is effective for user-dependent 
document retrieval and for providing information JIT. To sum up, digitizing 
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City. Town 


i§ii][i]@nnnnn 






StfMl Numbor 


[4]B[i]E[i]nnnnn 





Recognition Result : Accept 

C^mctor Ktrins nplumd outudo (ram n prndnfinnd anaa 
-Comc^emented Characters Hines) 




Changctor aUini which is nol a rsuoenitx>n taraal 
-Detetad Charactara 

jt 



Fig. 1. An example of the personal annotation by Digital Pen 

handwnun« cotfad tart 




Fig. 2. Integration of real-world annotation with e- world annotation 



handwritten annotations on paper documents is an essential function in the 
personalized digital library proposed in this paper. 



3 Hybrid of Real and e- World 

for Digitizing Personal Annotation 

To realize the digitization of personal annotations, we adopted the digital pen[3]. 
This device generates pen tracks data by sensing dotted patterns pre-printed 
on paper. By assigning unique dot pattern for each paper, the digital pen can 
identify on which paper and where in the paper annotation was made. Therefore 
annotated strokes and/or texts can be associated with the words or sentences 
originally printed on the paper. 

Fig.l shows an example of a personal annotation. This annotation means 
correction of a character string. In this example, some characters in boxes are 
replaced into the characters written above the boxes. As a result of interpre- 
tation of the annotation, character deletion is detected, the deleted characters 
are identified, and the remaining characters including inserted characters are 
collected and recognized correctly. 

In this way, handwritten annotations are interpreted and simultaneously 
stored in the personalized digital library. 
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On the other hand, annotations are not only handwritten but also typed. As 
described before, when referring other web site, or associating other document 
files with the current one, it is convenient to make annotations electronically on 
computer. Furthermore, annotated URL string written by using the digital pen 
should be linked to the specified web page on computer. Such web page could 
be printed for careful reading, and handwritten annotations should be made on 
paper. 

Thus, in order to record users’ reading and comprehending process naturally, 
annotations both on paper and on e-document should be digitized. As shown in 
Fig. 2, every e-document can be printed paper document at any time. Once hand- 
written annotations are made (on Paper 1), they are simultaneously digitized, 
and new version of the e-document is generated, while old e-document is labeled 
as expired one. When annotations are made on computer, new version of the 
e-document is also generated, and at the same time, old paper document (Paper 
1) should be flagged as an invalid document. It is realized by the paper and 
dot-pattern management software. If several paper documents are printed from 
the single e-document, e-document is prepared respectively by duplicating the 
original e-document file. 

In this way, personal annotations can be digitized and the history of the 
annotations is stored as metadata of the document in the library (in Fig. 2, we 
call it “Archiving Document”) for later use, that is, for providing appropriate 
information JIT. 

4 Conclusion 

We proposed a personal information environment, featuring the combination 
of e-world and real-world. Retrieved documents can be printed, highlighted, 
marked, and annotated. All the added information is simultaneously stored in 
the computer for later use. The personal digital library we would like to construct 
will also feature reading assistance, contents analysis, and intelligent knowledge 
management by utilizing the information on human action in reading and com- 
prehension process. According to this proposal, we are going to implement a 
prototype system. 
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Abstract. Despite the current practice of re-keying most documents placed in 
digital libraries, we continue to try to improve accuracy of automated recognition 
techniques for obtaining document image content. This task is made more diffi- 
cult when the document in question has been rendered in letterpress, subjected to 
hundreds of years of the aging process and been microfilmed before scanning. 
We endeavored to leave intact a previously described document reconstruction 
technique, and to enhance the document image to bring the perceived production 
values up to a more modem standards in order to process a novel of historic im- 
portance: Don Quixote by Miguel de Cervantes Saavedra. Pre-processing of the 
page images before application of the reconstmction techniques were performed 
to accommodate early 17th century typography and low-quality scanned micro- 
film images. 

Though our technology easily outstripped the capabilities of commercial OCRs, 
it too was found lacking, at this stage of development, for automated processing 
of historical documents for digital libraries. 

We had hoped to develop a useful transcription of the text and a lexicon of Span- 
ish contemporary with the composition of this novel. However the actual accom- 
plishment was limited to making improvements in the recognizability of the page 
images involved and providing a basis for further research. 



1 Introduction 

While it has been shown by Taghva, et al. [13], and by Ittner, et al. [3], that significant 
applications such as text retrieval and topic identification can be accomplished even in 
the presence of OCR errors, digital libraries require a much higher level of accuracy in 
terms of faithful rendition of the source text. 

The process of document reconstruction (RECON) is continuously evolving. In 
1995, Reynar, Spitz and Sibun [5] described a process of reconstructing a document 
from its image without resorting to Optical Character Recognition (OCR), leaving am- 
biguities to be resolved by a downstream process or by a human reader. Later, Spitz [6] 
described an unusual, lexically-driven word recognition engine based on the reconstruc- 
tion and resolving ambiguities based on selective matching of character bitmaps. Both 
of these processes rely on a robust transformation of character images into Character 
Shape Codes (CSCs). It is this robustness, described by Spitz and Marks [9], that led 
to the application of shape coding and reconstruction as a technique for handling poor 
quality images. 
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Nagy, et al. [4] and Ho and Nagy [2] took a different approach, relying on similarity 
of bitmaps and assigning character identities based on statistics of occurrence as well 
as lexical matching. 

An enhancement of the document reconstruction technique was described by Spitz 
[10] and draws from all three of the previously cited processes, thus providing an inte- 
gration of information from multiple sources: word shape, character bitmaps and lexical 
information. Implicit non-lexical language model information is dynamically inferred 
based on the character shape distributions. 

Because of its inability to deal with non-lexical tokens such as number strings and 
punctuation, document reconstruction cannot ever be the sole basis for conversion of 
documents to for use in digital libraries. However for largely text documents such as 
novels, document reconstruction might provide significant contribution to the faithful 
rendering of the text of the document in the digital domain. 

Cannon et al. were able to significantly improve OCR performance by pre-processing 
the image [1], after making evaluations of which image processing steps were likely to 
help in each particular document image. 

We have obtained 8 digitizations, from microfilm, of the novel Don Quixote per- 
formed by the Biblioteca Nacional, Spanish Royal Academy, Oxford, Harvard and Yale 
Universities, the British Library, and two copies from the Hispanic Society of America. 

At least two separate printings are represented: 1605 and 1608. However even 
within a single printing there are variations caused by changes in the printing process. 

In this paper we will use as examples the 6 pages (from page 15 verso to page 18 
recto) that comprise Chapter 5. 

In Section 2 of this paper we will describe in some detail the challenge in terms 
of production values and image quality. Section 3 will show the results of processing 
these difficult images using commercial OCRs. Starting in Section 4 we will describe 
the special purpose image processing that was required in order to be able to reconstruct 
this document. 

In Section 5 we will describe the character shape coding process and the word shape 
encoding resulting from the agglomeration of these CSCs. Section 6 describes the struc- 
ture and contents of two lexica used in the reconstruction process. In Section 7 we 
will describe the collection of character bitmaps and the construction of lists of similar 
bitmaps. 

Section 8 will describe the initial labeling of these lists with character codes. Section 
9 describes the progressive reduction of ambiguity on a word-by-word basis. We present 
preliminary accuracy results in Section 10 and present some conclusions in Section 11. 

2 Typography 

There are several challenges in the production values used in the printing of this book. 

Chapters start with an illuminated drop capital character followed by a capital- 
ized second character. Figure 1 shows the rendering of the initial word of Chapter 5 
“VIendo”. 

There appear to be no intentional ligatures in the text such as fl, but the image degra- 
dation was severe enough that many character pairs touched, and in the italic chapter 
and page headings long runs of touching characters were not uncommon. 
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lendopucs 
fe, acor4’^ 
medio, qae 
l!fecos,y*#t 



Fig. 1. Chapters start with an illuminated drop cap (in this case: V) followed by a capitalized 
second letter (in this case: I). 



The t in the font used has a very small ascender. It is not unusual for t ascenders to 
be shorter than those for b, d, h, k etc., but in this font it is very difficult to consistently 
classify ts as either ascender or x-height characters. 

Hyphenation is inconsistent. The text was set fully justified but in some instances a 
word broken across lines was hyphenated and sometimes not. See for example, Figure 2. 
It was hard to devine the rules, if there were any, for when hyphenation was used or not. 



lendo pues que cB«feC6ito podia mencaf 
(fe, acord^ dc acogcrfc a fu Qfdinario re- 
medio,que erap6far cn algun^paflb de ius 

Fig. 2. Note that the last word on the first line “menearse” is not hyphenated while the last word 
on the next line is hyphenated. 

There were multiple examples of words that were broken across pages. In this in- 
stance they were hyphenated and the terminal part of the word appeared right justified 
on the last line of the page as well as at the top of the following page. 



de lUuar tn^cargade ttigo al moUodiel qual vicn*- 
do aquel hd^ke alU tcndido, el,y 

Fig. 3. The word “pregunto” broken across pages. The top panel shows the bottom of Page 15 
Verso, and the bottom panel shows the top of Page 16 Recto. Note that “gunto” appears redun- 
dantly. Also note the deep kerning of the initial letter of “Quixote” in the second line of 16R. 



There are no paragraph indents or vertical spacing, and no quotation marks. 

3 Conventional OCR 

We submitted partially processed page images to two commercial OCR packages. These 
images had been segmented to isolate the text from the page edges and had the italics 
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and illuminated drop caps removed. These same images were used as input to the OCRs 
and the RECON process. 

The OCRs used were Omnipage Professional 9.0 and Finereader Pro 6.0. The char- 
acter and word accuracy rates are shown in Table 1 . Clearly Omnipage inserted many 
bogus characters while Finereader deleted a significant number of legitimate character 
images. In any case the word accuracy rates are sufficiently low to render these data 
useless for a digital library (or any other) application. 

Table 1. Character and word accuracy for two OCRs. 





Number of Characters 


Character Accuracy 
(%) 


Number of Words 


Word Accuracy 
(%) 


Truth 


8800 




1598 




Omnipage 


15212 


40 . 6 


3228 


0.12 


Finereader 


6330 


28 . 5 


774 


1.2 



4 Pre-recognition Processing 

The images supplied us were scanned from microfilm with two facing pages per frame. 
It was necessary to develop document-specific techniques for segmenting the individ- 
ual pages and for reducing the noise level in the document images. These techniques 
turned out to be pretty trivial. First a connected component analysis was performed and 
then all components too tall to be of text were removed. This had the effect of remov- 
ing the black areas outside of the page frames and illuminated drop caps as well, but 
since we had no technology to deal with drop caps anyway, this was considered to be 
appropriate for the application. The page was deskewed based on a single angle for the 
whole page and projection profiles of the remaining connected components determined 
the boundaries of the text frame. 

The image processing specihcally developed for this project took the form of a pre- 
processing filter on the input image. This was done in order to obviate the need to adapt 
the reconstruction processor to the aberrant typography. 

The spatial resolution varied such that there was a range of approximately 14 to 
45 pixels in the x-height. This results in a font height of 30 to 100 pixels.The image 
quality by modern standard was poor, due to the degradation inherent in microfilming 
and subsequent scanning, the paper and print quality and the fact that the book had been 
subjected to nearly 400 years of aging. We noted that the best perceived image quality 
was not, in this instance, associated with the highest spatial resolution. 

4.1 Noise Removal 

There are multiple sources of noise in the images we analyzed, the two principal ones 
being the speckle created by the printing process and those inherent to microfilming. 
We dealt with speckle in much the same way as Cannon, et al. described [ 1 ] . 

There is considerable print-through on some pages. 
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Table 2. Resolution characteristics of the 8 scans of the Don Quixote text. 



Source x-height (pixels) font height (pixels) dpi at 1 2 pt 



bd 


19 


48 


288 


bm 


27 


60 


360 


bn 


22 


48 


288 


hsl 


45 


100 


600 


hs2 


26 


58 


348 


hu 


14 


30 


180 


ra 


21 


43 


258 


yi 


14 


32 


192 



4.2 Line Straightening 

All of the images are of bound volumes resulting in significant warp of some text lines 
as they near the gutter. In addition, apparently the pages were often distorted by applica- 
tion of a clear platen, resulting in a distortion that can be characterized as continuously 
variable skew as a function of position on the page. The variable skew was removed first 
using techniques described in Spitz [12]. The line warp was removed in the process of 
character shape coding described in [10]. 



4.3 Word Spacing 

Word spacing is highly variable and in many instances is so tight as to make the it hard 
to distinguish word spacing from letter spacing. The use of a lexicon-based recognizer 
requires that we be able to accurately delimit word images. A partial solution to this 
problem was effected by applying a non-linear expansion of the inter-character spacing 
to make small increments somewhat larger. 

However the text was set without a word space following a comma as is current 
practice. Here it was necessary to explicitly insert a word space after the shape coding 
process had detected a comma. 

4.4 Noise Cancellation 

An attempt was made to take advantage of the multiple images available of the same 
pages. It was hoped that by scaling and superimposing images that noise would drop 
out. We first cropped the page frames to the text and scaled them to the same dimen- 
sions. We then multiplied the images together. 

The result obtained from superimposing the two best images of page 16 recto is 
shown in Figure 4. Note that though the images are very closely superimposed at both 
the top and bottom of the page, the vertical center region of the page is not in regis- 
tration. The reason for this result is unknown but is likely to be due to differential and 
non-linear paper shrinkage though it might also be due to non-affine optical distortion 
in the microfilming process. 
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Qmxtlt lelt MdHdn 



16 



Item* n Oofl ^KOMiCreyo (^adod;t:t9sie 

168*1 <ft (1 Mifdae* ua,)r afit ms !e 

ee’dend* U im At 

4si hijs 4*1 f&a At ts 

is 4* u ffisiHii ffliflffi 8fi£ *i femwf* is mm. 

II UIh 48F *(!*«* 4if- 

piisi f SBitJaaelsii u 

Bi is m MSijiel/mmaisttfS/^ 
K \mi BBBiSHS 4g fiBluB- imm k ims itm 
pa8)8U3fl48tE E8H8Fl§jf,k 4tf8 §£ri«F 

UBi leuE am tB 4*^1 aiijitsaf 4 ^ 
mmi flaamaBiii4»4£tu4ifg8 mmaSiHi- 
«'1»8 ifi4amE| §uiB»a fads a mm m/mi 
Ma toil B^6*l regHii tee k immsi ^a- 
t@ ie BF*g8ft‘iui Vi*fl48 *#8 4 im« hm^is 
^U£ pu4e'i*9Hifeeip*iejy«^*M*fj^m«*«r 
fi twi ligBBj hcci4i m am fmgm i « 6< 
^ Pmdce laulBitf l« 4*i Aiel«>)r M ««« 
wibaje . I* fkbid fofar* /u )«n*iu8 ^ fn«- 

iettiaalhtummCe^egadi ReeepeimMfmmJtaf^ 
tu 111 afttUu da la lanfi i y tiolai fobf* Raima- 
te, al qaii tosid dc la r ianda,y d«lc*b«fira*} afee, 
$f««<Kaa<nobattaf«pueble,bi«np«fl^iii8 4e an 
Ins difpanies qua don Qg^iaotc dcza y aa meaat 
fva don Qaixote.qaede fiitro i»ot)d«,)r qwbtM^ 
cado no le podia tenet fobreet bomte.y de qonii- 
daeoqoaado dauaYnotfHfpirotqne loi p«ma e«et 
cadsidc modo que de nttciio obligoa q««<l la4ra- 
(Sac le ptegoiitaflcile dixtfleique tnaliemi* y eepf 
me bm qoc cl dublo le cnua a la memoi»>io« 
M«sa«*aconsodadoiafiisfactfl<>s poiqac cn aqecl 

psnto> 



Fig. 4. Result of attempt to register two images of the same page showing good registration at 
the top and bottom, poor registration in the middle due to non-affine distortions in one or both 
images. 



Kerning, especially on characters like Q leads to deep overlaps in character hound- 
ing boxes. Baselines wander. An attempt was made to correct all of these image charac- 
teristics by enforcing constant intra-word spacing (thereby dekerning character pairs), 
straightening baselines, inserting word spaces after a comma and inserting leading to 
keep descenders and ascenders from tangling. The results of this processing are shown 
in Figure 5. 

5 Character Shape Coding 

The CSCs encode whether or not the character in question fits between the baseline 
and the x-line or if not, whether it has an ascender or descender and the number and 
spatial distribution of the connected components. For a more complete discussion of this 



que q«ienerd,yqoein3irenti4^qaetaD trtf^e 
feqocxaua^ Don Qaisoce crey o f nduda qae 

que quien era, y que mal fent:a, que tan trsfte 
fequexaua^ Don Q^aixote creyo f nduda que 

Fig. 5. Separating image elements for more accurate word detection. Note that the i and ? dots 
have been removed in the noise reduction process. 
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process see [6]. Character shape coding has been revised several times and selection of 
the most appropriate version is a function of the application being served and the fonts 
in use as well as the range of image quality to be processed. Because of the poor image 
quality available, we used Version 0, the most basic, and therefore most robust, version. 

Table 3. Definitions of (Version 0) Character Shape Codes for alphabetical characters. 



Characters 


CSC 


Definition 


A-Zbdfhkl 


A 


ascender 


acemnorsuvxwz 


X 


between baseline and x-line 


?gpqy 


g 


descender 


ino 


i 


x-height plus mark above 


j 


j 


descender plus mark above 



There are a few CSCs, not shown in this table, that represent punctuation. 

CSCs are used internally within the document reconstruction process and their 
word-level aggregation' into Word Shape Tokens (WSTs) are used as indices in lex- 
ica. 

The results of CSC processing are shown in Figure 6. 



leal. Y de<s>ta manera fue pro<s>iguiendo el romance, 
ha<s> 



leal. Y tlefta tBancrafue prefiguieswioclromanccjhaC 

AxxA. A AxAAx xxxxxx Axx xxAigxixxAx xA xxxxxxx, AxA 

Fig. 6. Truth text, printed and scanned representation and the result of the character shape coding 
process. Note <s> shown for long s in truth. 



6 The Lexicon 

6.1 Structure 

A RECON lexicon is simply a series of word lists, each indexed by a common WST. For 
a more complete discussion of lexical structure see [6]. Though in operation RECON 
is not totally reliant on the correctness of the WST, its performance is highly dependent 
on this transformation. In many instances there may be a single surface form word 
that maps to the WST thereby defining all of the character positions simultaneously. 
However even if there are multiple words in the lexical entry, there are often some 
character positions where the character identity is unambiguous. 

* Leading and trailing punctuation marks are ignored. 
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The ambiguity, or lack of ambiguity, cau be represeuted iu a uumber of ways. Of 
course if cau be represeuted by the word lists themselves or by two types of regular 
expressiou. The degeuerate regular expressiou (DRE) eucodes ambiguity with alteruate 
characters euclosed iu brackets []. The simplified regular expressiou (SRE) eucodes all 
character positious where there is ambiguity with a ?. 

6.2 Contents 

Don Quixote was written around the turn of the 17th century using Spanish spellings 
some of which are not acceptable by today’s standards. Eor this reason, use of a modern 
lexicon is inappropriate. Were it available, we would use a lexicon consistent with the 
language and spelling prevalent at the time the book was written. However we had no 
access to such a lexicon. 

As Spitz has shown [11], the performance of our reconstruction technique is im- 
proves with the use of the optimum lexicon: one with a large intersection of the words 
in the document and with a minimal number of words not found in the document. 

Eor this application we developed the lexicon iteratively, starting with a bootstrap 
list of frequently-occurring words: articles, pronouns and some proper nouns, shown in 
Eigure [5]. 



Dulzinea Mancha Mantua Marques Quixote Rozinante Valdouinos a 
al como con cura de del don donde dos el ella ello ellos en es estas 
este esto estos la las le lo los mas me ml muchas muy nl no o por 
qual quando quatro que se senor senora si su tio todo y yo 

Fig. 7. 54 common words in Don Quixote comprising the bootstrap lexicon 

This tiny lexicon contains 48.5% of the words found in Chapter 5. The shape coded 
lexicon is shown in Table 4. Note that all of the uncapitalized words shown in Eigure 7 
are present in the lexicon in both uncapitalized and capitalized form. Thus the size of 
the lexicon expands from 54 words to 101 surface forms. 

These 101 surface forms are indexed by 44 WSTs, 20 of which are singleton repre- 
sentations. That is, 20 WSTs unambiguously define the lexical form that they encode. 

7 Character Bitmaps 

We process the line, word, character cell structure by comparing the bitmaps found in 
each character cell with all extant lists of characters. Because it is very important that 
a list contain only the bitmaps of a single identity character (see exception below) we 
employ a very strict criterion for declaring a match between a candidate bitmap with 
the bitmap characterizing the list, we either accept the candidate as a member of that 
list, or if no match is found, start a new list. 

When we have finished generating the pattern lists, we sort them so that the longest 
list, the one with the greatest number of matching bitmaps, is first and the populations 
of the lists monotonically decrease. We show in Eigure 8 the first 10 instances of each 
list. Note that at this stage we do not yet know the character identity for the lists. Also 




Tilting at Windmills: Adventures in Attempting to Reconstruct Don Quixote 



59 



note that there might he more than one list containing slightly different bitmaps of a 
particular character due to the tight threshold for inclusion in an extant list. 



eceecceecc 


eece 


nnnnnnnnnn 


0000 


ooooooooo 


aaaa 


oooooooo 


uuuu 


oooooooo 


0000 


UUQUUUUU 


eeee 


ceeecee 


c r r r 


r r r r r r r 


1 1 1 1 


eeeeeee 


qqqq 


r r r r rr 


ccc 



Fig. 8. Character cell bitmap lists in order of decreasing frequency. Note duplication. 

The highly variable character morphology extant in these images, combined with 
the strict matching criterion, resulted in many lists being generated for each intended 
glyph. Since the long s, which looks much like an f without a full crossbar, was indis- 
tinguishable from the f using the RECON bitmap comparison technique, we lumped 
the long s and f together in the bitmap domain and resolved the ambiguity at the lexical 
level. Examples of the two character forms are shown in Eigure 9. 



cfctO' acogcrfc 

Fig. 9. Two words from the text. The word on the left contains an f. The word on the right contains 
a long s. 

From this point forward it is presumed that there are no lists containing bitmaps 
arising from different characters. 

8 Labeling Bitmap Lists 

Looking at the unambiguous character positions in the SREs we label the corresponding 
individual bitmap in the lists with the character code. If no mistakes were made in 
character shape coding and if the lexicon is adequate, all of the bitmap list entries will 
be labeled correctly (if at all). 

The lists themselves are now labelled if at least two individual bitmaps are consis- 
tently labeled and there are no inconsistent labels. If there are inconsistencies in the 
labeling, the list is assigned the label of the preponderance of bitmap labels and those 
bitmaps with the inconsistent labels are marked for further processing. Preponderance 
is defined as the number of consistent labels being greater than 4 times the number of 
inconsistent labels. 

Note again that there may be more than one list with a particular character label. 
And it is possible that some lists, particularly short ones, will remain unlabeled. Lists 
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Table 4. WST indexed bootstrap lexicon (abbreviated for space) 



WST 




Words 




A 


A 0 Y 






AAAx 


Ella Elio 






Aix 


Tio tio 






Ax 


De En Es La Le Lo Me No 


Se Su Yo de 


la le lo 


AxAAxxixxx 


Valdouinos 






AxAx 


Este Esto Todo todo 






AxAxixxx 


Dulzinea 






AxAxx 


Estas Estos 






Axixx 


Senor 






AxixxAx 


Quixote 






Ax i XXX 


Senora 






Axx 


Con Don Dos Las Los Mas 


Por Que don 


dos las los 


AxxAx 


Donde donde 






AxxAxx 


Mantua Muchas Quatro 






AxxixxxAx 


Rozinante 






AxxxAx 


Mancha Quando 






g 


y 






gxx 


por que 






X 


a o 






xxAx 


este esto 






xxAxx 


estas estos 






xxg 


rauy 






xxxAxx 


rauchas 







labeled with the same character are concatenated if any pair of characters, one from 
each list, have a small enough difference. This allows for different degradations of the 
character form to be handled, even if the characters on the heads of the two lists differ 
by an amount that exceeds the threshold. 

Unlabeled lists are likewise concatenated with labelled ones if a (nearly) matching 
pair of bitmaps is found. 

Once a list has been labeled all of the individual bitmaps on that list are likewise 
labeled. Doing this materially reduces ambiguity. 

Resolution of infrequently occurring characters such as capitals is difficult because 
their rarity makes it impossible to assemble long lists. 

9 Word Recognition 

It may be that the labeling of the pattern lists will result in all of the characters in a 
word also being labeled. If only some of the characters are labelled, the lexical entries 
for the WST are examined and those that are inconsistent with the defined characters are 
removed. Often this will result in characters becoming defined that have not yet been 
directly examined. From this new information it is possible to label some previously 
unlabeled pattern lists. 
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Each reduction of the lexical entry results in a re-calculation of the SRE and DRE. 

It may also happen that the application of the character identities will result in the 
removal of all of the elements of the lexical entry indicating either a CSC error or the 
presence of a word not in the lexicon. In either case the word is marked for character 
by character resolution without further recourse to the lexicon. 

Residual ambiguity is reflected in the DRE. In each character position, the bitmaps 
on the appropriately labelled lists for the legitimate alternative characters can be checked 
against the character cell bitmap. Each time character position ambiguity is resolved, 
the lexical entries are pruned. 

When a word is recognized as a result of resolving individual character forms, as 
opposed to finding it through lexical look-up, that word and its associated WST is added 
to the lexicon in order to enable future recognition of that word and to progressively 
offer greater coverage (increased definition) of the document contents in the lexicon. 



10 Accuracy 

This is an instance where it is very difficult to know the truth. The author was supplied 
with eight manually produced transcriptions for Chapter 5. These transcriptions differ 
in small but significant ways. For tests of character accuracy we used the transcription 
supplied with the Hispanic Society digitizations. 

Since RECON relies on a lexicon, its performance will be affected by the appropri- 
ateness of that lexicon to the document being processed. See Spitz [6] for a discussion 
of lexical intersection and specificity. 

Word accuracy using the document reconstruction technique with the bootstrap lex- 
icon was 1 1.4%. This abysmal result was only notable with respect to the even greater 
failures exhibited by the commercial OCRs. 



11 Conclusions 

This paper does not present scientific results of great significance. Indeed while it is 
still possible that the paradigm of pre-processing page images to suit the RECON pro- 
cess, and to post-process the character code information returned from RECON, will 
result in a useful transcription of the source text, this paper merely sets out some of the 
problems encountered when traditional document recognition techniques were applied 
to the difficult images of Don Quixote. 

We posit that application of these techniques to longer passages of text than the 
single chapter available to us at this time, would show that the progressive resolution 
paradigm used here would increase performance, both in speed and accuracy, the longer 
the text. 

We are wary however that errors introduced early in the recognition process might 
produce a reciprocal degradation in accuracy for some specific word shapes. 

Having multiple digitizations of the same pages has allowed us to do explore tech- 
niques to take advantage of the redundancy of information in those images: something 
smarter than processing them all and taking the results from those that produce the 
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“best” results. Since there are several variables between digitizations it might be possi- 
ble to overlay these images in an attempt to cancel out the noise in the images. Once that 
is done it would be tempting to apply the techniques of Xu and Nagy [14] to improve 
character bitmaps by statistical techniques rather than by selection. 
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Abstract. Recognition of old Greek manuscripts is essential for quick and effi- 
cient content exploitation of the valuable old Greek historical collections. In 
this paper, we focus on the problem of recognizing early Christian Greek manu- 
scripts written in lower case letters. Based on the existence of hole regions in 
the majority of characters and character ligatures in these scripts, we propose a 
novel, segmentation-free, fast and efficient technique that assists the recogni- 
tion procedure by tracing and recognizing the most frequently appearing char- 
acters or character ligatures. First, we detect hole regions that exist in the char- 
acter body. Then, the protrusions in the outer contour outline of the connected 
components that contain the character hole regions are used for the classifica- 
tion of the area around holes to a specific character or a character ligature. The 
proposed method gives highly accurate results and offers great assistance to old 
Greek handwritten manuscript OCR. 



1 Introduction 

Recognition of old Greek manuscripts is essential for quick and efficient content 
exploitation of the valuable old Greek historical collections. In this paper, we focus 
on early Christian Greek manuscripts written in lower case letters. The Sinaitic Codex 
Number Three, which contains the Book of Job, constitutes the main axis of our re- 
search (see Fig. la). 

The field of handwritten character recognition has made great progress during the 
past years [1]. Many methods were developed in an attempt to satisfy the need for 
such systems that exists in various applications like automatic reading of postal ad- 
dresses and bank checks, form processing etc. [2,3]. For handwritten character recog- 
nition two main approaches can be identified: The global approach [4,5] and the seg- 
mentation approach [6,7]. The global approach entails the recognition of the whole 
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word while the segmentation approach requires that each word has to be segmented 
into letters. Although the global approach can be found in the literature as “segmenta- 
tion-free” approach, it involves a word detection task. Some approaches that do not 
involve any segmentation task are based on the concept and techniques of occluded 
object recognition [8,9]. According to those approaches, significant geometric fea- 
tures, such as short line segments, enclosed regions and corners, are extracted from 
the image of the entire page. 

Traditional techniques for handwritten recognition cannot be easily applied to en- 
tail early Christian Greek manuscripts written in lower case letters, since they explore 
several unique characteristics, such as: 






Continuous connection between characters even between different words. 

Largely standardized scripts since they are immediate predecessors of early printed 
books. 

Character ligatures are very common. 

Hole regions appear in the majority of character and character ligatures. As shown 



in Fig. lb, hole regions appear in letters 









0 ”, 



and in letter ligatures “at;”, “st;” 
typical old Greek manuscript. 



etc. These constitute 60% of all characters of a 



\KCii o **-l odv lou ^ 

DO x,*v^JLrr»~^ 

“'&y w ^1 y 

^ O-ir • u.4.> 

..... 






/-??o5f/-i^SSv’*Ta' gj^irsv •^vISv^l'FP 

<tLL Euldtii- EjySio^S4.lHo ^^3^EJjrtEu[. 

. _ E!M^SS^^/&E!{J^eCU3’ 

(b) 



Fig. 1. (a) Early Christian Greek manuscript - The Sinaitic Codex Number Three; (b) Identified 
characters or character ligatures that contain hole regions. 



Due to the continuous connection between characters we developed a segmenta- 
tion-free recognition technique to assist an Old Greek Handwritten Manuscript OCR. 
Based on the existence of hole regions in the majority of characters and character 
ligatures, we propose a technique for tracing and recognition of characters that con- 
tain holes. The proposed methodology does not aim at a complete character recogni- 
tion system for old Greek manuscripts. It rather aims at an assessment of the recogni- 
tion procedure by tracing and recognizing the most frequently appearing characters or 
character ligatures, using a segmentation-free, quick and efficient approach. The 
method has been developed in the framework of the Greek GSRT-funded R&D pro- 
ject, D-SCRIBE, which aims to develop an integrated system for digitization and 
processing of old Greek manuscripts. 
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2 Methodology 

The proposed methodology consists of several distinct stages. Initially, we trace hole 
regions that exist in character bodies. We suggest a novel fast algorithm based on 
processing the white runs of the initial b/w image. This algorithm permits the extrac- 
tion of the character hole regions but rejects hole regions of larger dimension, such as 
holes inside frames, diagrams, etc. At the next stage of our approach, all character 
hole regions are initially grouped into several categories according to the distance 
between them. In this way, character hole regions are classified as: an isolated hole, 
two horizontal neighboring holes, three horizontal neighboring holes, etc. The final 
stage of our approach includes the final classification of hole regions or groups of 
hole regions into a character or a ligature according to the protrusions that exist on the 
outer contour outline of the connected components that contain the character hole 
regions. 



2.1 Character Hole Region Detection 

There exist several hole detection algorithms mainly based on contour following and 
distinguishing external and internal contours [10,11]. We suggest a novel fast algo- 
rithm for hole detection based on processing the white runs of the b/w image. Below 
follows a step-by-step description of the proposed algorithm. 

Step 1 . All horizontal and vertical image white runs that neighbor with image borders 
or are of length greater than L are flagged, where Lisa characteristic length reflecting 
character size. 

Step 2. All horizontal and vertical white runs of not flagged pixels that neighbor with 
the flagged pixels of step 1 are flagged as well. 

Step 3. Repeat step 2 until no pixel remains to be flagged. 

Step 4. All remaining white runs of not flagged pixels belong to image holes. Possible 
holes with very small white run lengths are ignored. 

The proposed algorithm for hole detection extracts only the character hole regions 
and not other hole regions of larger dimension, with white run length greater than L, 
such as holes inside frames, diagrams etc. An example of the proposed hole detection 
algorithm is demonstrated in Fig. 2. 




(a) (b) (c) (d) (e) 

Fig. 2. Hole detection algorithm demonstration: (a) Original image;(b)-(e) Resulting image 
after 1,2,5 and 19 iterations, respectively. 
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2.2 Hole Grouping 

Hole regions are grouped together according to their proximity and topology. Table 1 
shows the dictionary of the hole patterns, including their configuration and the corre- 
sponding character or character ligature. 



Table 1. Hole patterns dictionary. 



Pattern ID 


1 


2 


3 


4 


5 


6 


Pattern 


o 


o o 


0 0 0 


0 0 0 0 


o 

o 


o 

o o 


Characters 

or 

character 

ligatures 




(o) 

^(£), 

P S' 

^ (P), ®(5) 


' ^ (tt), 
^ (co), 

^(£0) 


cro 

(CTl), 

(STl) 


OtEtJ 

(ano) 


(0) 


% 

(<p) 



2.3 Character Detection 



Feature extraction is applied to characters that contain one or more holes. In order to 
calculate these features it is needed to isolate the connected component into charac- 
ters that contain one or more holes. The method we use for character isolation takes 
advantage of the characteristics of the outer contours’ pixels of the component and 
the contours’ pixels of the hole. The coordinates of these pixels can be extracted by 
using a contour following algorithm [12]. Fig. 3 shows three components with 1 hole 
each. 

Let !}{={{ xf ,yf ), i e [l,w] }, be the set of the pixel coordinates, that composes the 
contour of the hole and C=|(xf,jf ), /g [l,m] }, be the set of the pixel coordinates 
belonging to the outer contour of the character. 

We define the distance between the hole and the connected character component 
as: 

D(^,C) = max{min{.^(x,;^ - x‘j f + (yf - y'j f } ( 1 ) 

Calculating D{HC) and using pixel coordinates which compose the contour of the 
hole, it is easy to isolate the pixels of the connected character component contained in 
a bounding box W with the following top-left {xj^yj^) and bottom right corner coor- 
dinates (Xggjg^y. 



Xtt — 



min(xf ) 



min(x,^^) - 2 ■ D{HC) 



If min(xf)>min(xf^)-2-D(:tf(^) 
i i 

otherwise 



(2) 



Ttx - 



min(jf) 

i 

mm{y^)-2-D{HC) 



If mm(yf)>mir(yf)-2-D(HC) 
,■ i 



Otherwise 
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Xrr — 



ysR - 



max(xf ) 



If max(xf' ) < max(xf^) +2- D{HC) 

‘ i 

otherwise 



max(xf ^ ) + 2- D{HC) 



max(yf) 

i 

max{yf)+2-D{HC) 



if max( y‘[ ) < max( yf)+2- D{^C) 

' i 

otherwise 



Fig. 4 shows the contours of the component with the respective windows W 
around the holes. 




Fig. 3. An image of three independent components that contains a Greek word. 




Fig. 4. The contours of the component with the respective window W around the holes, 
a) Greek character “a”, b) Greek character “a”, c) Greek character “s”. 

2.4 Feature Estimation 

The features are computed in order to identify all segments that belong to a protrusion 
as it is delineated by the outer contour outline of the isolated character. Feature ex- 
traction is applied in two modes; the so-called vertical and horizontal mode. The 
vertical mode is used to describe the protrusible segments that may exist on the top 
and the bottom of the character while the horizontal mode is used to describe the 
protrusible segments that may exist on the right side of the character (see Fig. 4). The 
feature set which is composed of 9 features ^F=(f],f 2 ,---,fg) express the probability of 
a segment being part of a protrusion. Features fj, f 2 , describe the protrusible seg- 
ments that may appear on the top of the character, /^, f^, /g describe the protrusible 
segments that may exist at the bottom of the character and the last three features de- 
scribe the protrusible segments that may exist on the right side of the character. In our 
approach we have not taken into account segments that may belong to left protru- 
sions, due to our observation that in all cases they correspond to a letter ligament 
rather than the main body of a character. A sound example is given in Fig. 4a in the 
case of Greek character “a”. 
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Feature estimation is employed in the following steps: 

• Step 1: Bounding box division into blocks 

In vertical mode we compute the mean value Y of all y^-coordinates of H set, which 
is denoted as: 




We divide W into three areas of equal width and assign a divide line F(x)=Y as it is 
shown in Fig 5a, resulting in 6 blocks Rj,...,Rg. 

Furthermore, for the horizontal mode we compute the mean value X of all x^- 
coordinates of H set, which is denoted as: 




Similarly, we divide W into three areas of equal height and assign a divide line 
F(y)=X as it is shown in Fig 6a, resulting in extra 6 blocks Ry...Rj 2 - 

• Step 2: Block correction 

In order to describe the pixels of a protrusible segment, ignoring the other pixels of 
the character, the process must be restricted to the upper side of blocks R- with 
i G [1,3] , to the lower side of blocks R^ with i g [4,6] and to the most right side of 
blocks R^ with i g [7,9] . Blocks R^g ,Rjj ,Rj 2 are not used for feature extraction be- 
cause any protrusion in these regions is a letter ligature rather than a dominant part of 
the character. Thus, for each block, we compute an offset ( P . ) to the divide line at 
the corresponding mode (see Fig 5b, 6b). This leads to a new assignment of the block 
area that is different for each block that initially partitioned the bounding box. The 
corrected blocks will delimit the area for the block-based feature computation. 

The offset, at the vertical mode (Py ) and the horizontal mode (P^,) is calculated 

as follows: 

min(yf , C) ^ null) v ( / g [1,3]) 

Py = max{y^^‘) + DitHj,,,C) ^ null) v (/ g [4,6]) 

' j 

D{HC) ^j^^=nu\l (5) 

\max(xf‘‘‘) + D(tHif,C) (tHu null) v (/ g [7,9]) 

D{9{j^^,C) :Y^_=null 

!}{ 

where = {{Xj ,yj *' ) <z .W ,yG [1,«^ ],/g [1,9]} are the sets of the pixel coordi- 
nates, which consist the part of the hole contour depicted in block Ry 

• Step 3: Block-based feature computation 

For this step, crucial role is played by the directions of the contour pixels evaluated 
by considering pairs of adjacent pixels during a clockwise tracing of the outer con- 
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tour. Hence, for each pixel j of the contour we introduce as Sj the local orientation of 
the contour, taking nominal values from the set {W,SW,S,SE,E,NE,N,NW} . Once 
the directions are evaluated the proposed feature /j is defined as follows: 



/;■ 



1 



D(HC) 









(6) 



where g -( -J is a function depending on the orientation of the pixel and the block con- 
sidered and m. is the total number of pixels of the outer contour in side block SR-. The 
term D(HC) is used as a “normalization” factor allowing for the feature to be inde- 



pendent of the character scaling. The gi(-')’s are explicitly defined in Table 2. They 
determine a pixel template unique for each block i that expresses the expected local 
orientation of the corresponding contour pixels. During contour following, g-f-j 
equals 1 when the examined pixel belongs to either a vertical or horizontal protrusible 
segment, and otherwise it equals 0. 

Extension of the above procedure for characters having more than one hole is 
straightforward. Multiple holes are treated as one after a merging process which pre- 
serves the initially topology. 



Table 2. g-( ) in 8-connectivity contour following. 




Fig. 5. Vertical mode , a) Region Rp R 2 ,...,R^, defined by the F(x) b) Sub-regions SRj, 
SRp-.-iSRg defined by offsets respectively, c) The values of features /;...,/g for the seg- 
mented Greek letter “s”. 
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(a) (b) (c) 



Fig. 6. Horizontal mode , a) Regions R^, Rg, Rg defined by F(y) b) Sub-regions SRy, SRg, SRg 
defined by the offsets respectively c) The values of features fy, fg, fg for the segmented 
Greek letter “s”. 



3 Experimental Results 

The purpose of the experiments was to test the classification performance of the 
handwritten manuscripts with respect to the proposed hole detection and feature ex- 
traction techniques. The overall experiments used samples coming from three differ- 
ent writers of the Book of Job collection, manually labeled with the correct answers. 

3.1 Hole Detection Evaluation 

The first series of experiments tested the performance of the hole pattern detection 
algorithm. Table 3 shows the dictionary of the hole patterns, including the number of 
pattern occurrences in the sample. Notice that the majority of characters is classified 
as having one or two adjacent holes. Especially, patterns with id 5 and 6 correspond 
to exactly one character each, which implies that a detection of these patterns is 
equivalent to identification of the corresponding characters. 



Table 3. The dictionary of hole patterns including the number of pattern occurrences. 



Pattern ID 


1 


2 


3 


4 


5 


6 


Pattern 


o 


o o 


o o o 


o o o o 


0 


o 












0 


o o 


Occurrences 


786 


130 


7 


3 


30 


11 



The hole detection algorithm requires no training and thus the totality of the avail- 
able labeled sample was used as a test set. Table 4 summarizes the results obtained by 
applying the algorithm, showing the recall and the precision rates for each one of the 
hole patterns. As seen from Table 4, the performance on both recall and precision is 
satisfactory. Again, the reader should focus on the results concerning the first two 
patterns, which correspond to the majority of samples and have thus stronger statisti- 
cal significance. 
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Table 4. Recall / Precision for the characters or character ligatures in each of the hole patterns. 



ID 


Recall 


Precision 


1 


95,81 


97,42 


2 


94,61 


86,62 


3 


100 


53,85 


4 


100 


100 


5 


84,37 


96,43 


6 


87,5 


100 


Overall weighted 
Recall/Precision 




95,24 



3.2 Character and Character Ligature Identification 

The second series of experiments test the performance of the character and character 
ligature identification algorithm. The experiments concentrate on characters and char- 
acter ligatures that correspond to patterns 1 and 2 (Table 1), since these patterns ap- 
pear in a great variety of characters. The experiment involved two steps: the feature 
vector extraction step and the training and testing of a statistical classifier. The focus 
of the experiments was on testing suitability of the extracted features, by measuring 
the classification performance of popular classification algorithms. 

To that end, characters and character ligatures coming from 3 different writers 
(referenced as iml, im2 and im3, respectively - Table 5) were gathered into two 
different datasets CLl and CL2, according to their pattern (one or two holes). Within 
each dataset, the feature extraction algorithm described in section 2.4 was applied to 
each character or character ligature. Notice that in this step, there has been no need 
for splitting the data in reference and test sets, since the feature extraction algorithm 
required no training. For the classification step, however, such split is necessary in 
order to measure the generalization performance of the trained classifiers. Thus, a 
series of different scenarios that the experiments were to be based on, with various 
ways of splitting has been constructed. Table 5 list the scenarios for CLl and CL2. 



Table 5. Training set / Test set configuration for the CLl dataset. 



ID 


Samples 


Training 


Test 


CLl-1 


754 


70(10% iml,im2) 


684 (90% iml, im2) 


CL 1-2 


754 


147 (20% iml, im2) 


607 (80% iml, im2) 


CLl-3 


479 


147 (20% iml, im2) 


332 (100% im3) 


CL 1-4 


402 


70 (10% iml, im2) 


332 (100% im3) 


CL 1-5 


1086 


754 (100% iml, im2) 


332 (100% im3) 


CL2-1 


123 


12 (10% iml, im2) 


111 (90%iml,im2) 


CL2-2 


123 


24 (20% iml, im2) 


99 (80% iml, im2) 


CL2-3 


73 


12 (10% iml, im2) 


61 (100% im3) 


CL2-4 


85 


24 (20% iml, im2) 


61 (100% im3) 


CL2-5 


184 


123 (100%iml,im2) 


61 (100% im3) 




72 Basilios Gatos et al. 



The classification step was performed using two well known classification algo- 
rithms, K-NN [13] and SVM [13,14], K-NN was used in two variants, with LI norm 
and L2 norm each time. Moreover, exhaustive search took place in order to determine 
the value of neighbors {k) that gave the best score. On the other hand, SVM was used 
in conjunction with the RBF kernel, a popular, general-purpose yet powerful kernel. 
Again, a grid search was performed in order to find the optimum values for both the 
variance parameter of the RBF kernel (y) and the cost parameter of SVM (c). The 
results for CLl and CL2, together with the optimal parameter values, are listed in 
Table 6 and Table 7 respectively. 

Table 6. Performance of the algorithms for the CLl dataset. Numbers in parenthesis represent 
the parameters used for achieving the optimum scores. The ID column corresponds to the diffe- 
rent scenarios as shown in Table 5. For the SVM kernel, the number of support vectors found is 
also given. 



ID KNN-Ll KNN-L2 SVM-RBF 



CLl-1 


90.49 


(k=l) 


90.78 


(k=l) 


93.42 


(r- 


=0.94, C: 


=50, SVs=45) 


CL 1-2 


94.06 


(k=l) 


93.73 


(k=l) 


95.05 


(7= 


=0.98, C: 


=50, 5V.y=62) 


CL 1-3 


97.89 


(k=2) 


97.59 


(k=2) 


97.89 


(7= 


=0.04, C: 


=50, SVs=63) 


CL 1-4 


94.27 


(k=l) 


96.08 


(k=l) 


97.28 


(7= 


=0.2, c= 


100, SVs=42) 


CL 1-5 


98.19 ( 


(k=10) 


98.19 


(k=4) 


98.49 


' (y=0.8, c= 


=l,5Vi=216) 



Table 7. Performance of the algorithms for the CL2 dataset. Numbers in parenthesis represent 
the parameters used for achieving the optimum scores. The ID column corresponds to the dif- 
ferent scenarios as shown in Table 5. 



ID 


KNN-Ll 


KNN-L2 


SVM-RBF 


CL2-1 


95.49 (k=l) 


93.69 (k=l) 


94.59 (7=0.1, c=10, SVs=9) 


CL2-2 


93.93 (k=l) 


92.92 (k=l) 


94.94 (7=0.1, c=20, 5'Fi=10) 


CL2-3 


100.0 (k=l) 


100.0 (k=l) 


100.0 ( 7 = 0 . 1 , c=10, 5'Fi=9) 


CL2-4 


98.36 (k=6) 


95.99 (k=l) 


100.0 ( 7 = 0 . 1 , c=10, 5'yi=ll) 


CL2-5 


96.72 (k=l) 


98.36 (k=l) 


100.0 (7=0.1, c=10, SVs=24) 



The scores that were achieved in both datasets were very high even in cases were 
the samples were few. This particular aspect is very encouraging, since it proves the 
good generalization performance of the algorithms. Furthermore, the fact that the 
algorithms were able to generalize so well, is also due to the robust feature vector 
representation scheme. The features selected are suitable for enabling the good over- 
all performance. Moreover, it can be assumed that if new characters were to be added, 
the performance of the algorithms, concerning the new characters, would still be 
high. 

As a particular case, we present two confusion matrices resulting from specific 
scenarios, showing the points where maximum confusion between characters based 
on the extracted features is to be expected. Table 8 shows the confusion matrix for the 
scenario CLl-1 of Table 5. This scenario involved taking 10% of the samples from 
the first two images as training set and 90% of the samples of the same images as test 
set as far as CLl dataset is concerned. The results correspond to the application of the 
SVM algorithm to the training and test set. In particular, notice the case of characters 
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“a” and “s”, which were mutually misclassified and the case of character “8” which 
was misclassified 9 times as “o”. 



Table 8. Confusion matrix for the scenario CLl-1 in Table 5. 
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Table 9. Confusion matrix for the scenario CL2-1 in Table 5. 
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Table 9 shows the confusion matrix for the scenario CL2-1 shown in Table 5. For 
this scenario 10% of the samples from the first two images were chosen as the train- 
ing set and 90% of the samples of the same images were used as the test set as far as 
CL2 dataset is concerned. The only errors produced here were that 6 of the n charac- 
ters were classified as co. The rest of the characters and character ligatures in this 
scenario were correctly classified. 



4 Conclusions and Further Work 

In this paper, we present a novel methodology that assists recognition of early Chris- 
tian Greek manuscripts written in lower case letters. We do not provide a solution for 
a complete character recognition system but we strive toward an assessment of the 
recognition procedure by tracing and recognizing the most frequently appearing char- 
acters or character ligatures, using a segmentation-free, quick and efficient approach. 
Based on the observation that hole regions appear in the majority of characters and 
character ligatures, we propose a recognition technique that consists of several dis- 
tinct stages. Experimental results show that the proposed method gives highly accu- 
rate results that offers a great assistance to old Greek handwritten manuscript 
interpretation. 

Future work involves the detection and recognition of all the remaining old Greek 
handwritten character and character ligatures that do not include holes, as well as the 
testing of the performance of the proposed technique for other types of old handwrit- 
ten historical manuscripts. 
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Abstract. The paper presents a document analysis system to retrieve metadata 
from digitized ancient manuscripts. This platform has heen developed to assist 
researchers, historians and libraries to process a wide variety of manuscripts 
written in different languages. In order to retrieve different metadata from vari- 
ous digitized documents, we propose a user-training system, which use robust 
approaches based on a sequential bottom-up process. We develop a low-level 
segmentation and a basic recognition stage which do not use prior knowledge 
on documents contents. Our objective was to study the feasibility to process a 
large variety of manuscripts with the same platform, which can be used by non- 
specialists in image analysis. 



1 Introduction 

There are many different objectives for the digitization of documents which depend 
on the final usage of the digital copy [1]. Among the digitized projects, some are 
concern with the rare medieval manuscripts, which can be accessed only by a reduced 
number of students and researchers. The digitization of rare collections will improve 
their accessibility to a wider audience and will protect rare and fragile documents 
from frequent handling. It also provides to historians, researchers, a new way of in- 
dexing, consultation and retrieval of information Medieval manuscripts contain a 
wide variety of metadata such as text contents, layouts, typography, writing styles, 
documents structures, author authentication, inks, paper texture, written annotations, 
signs, drawings, ornaments, decorated frames which help historians to date and au- 
thenticate the manuscripts. These metadata are difficult to retrieve manually and a 
fine indexing would not be possible without better automation by the retro- 
conversion process using image analysis. Metadata for medieval manuscripts are very 
precisely defined by specialists from the European project MASTER* (Manuscript 
Access through Standards for Electronic Records). 

We present a document analysis system for medieval manuscripts, developed in 
common with specialists from IRHT, the Erench Institute on Texts History. This 



* http://xml.coverpages.org/masterGentintr.html 

S. Marinai and A. Dengel (Eds.): DAS 2004, LNCS 3163, pp. 75-89, 2004. 
© Springer-Verlag Berlin Heidelberg 2004 
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platform has been developed for researchers, historians and libraries who want to 
retrieve metadata from these digitized medieval manuscripts. We have chosen the 
metadata which require an intensive manual work that can be automatically retrieved 
by an image analysis system. Our objectives consist on studying the feasibility to 
develop documents images analysis systems suited for medieval manuscripts, which 
can be used easily by non experts in pattern recognition and image analysis. This tool 
has been developed in order to help the collaboration with the archivists and histori- 
ans and make them familiar with pattern recognition systems. It also provides the 
opportunity to understand user’s needs. 

In a first section, we present the metadata, defined by specialists, which are possi- 
ble to retrieve successfully by image analysis systems for both Latin and Arabic 
manuscripts. 

The second section describes the architecture of the document analysis system and 
the approach used. Because of the complexity of medieval manuscripts and their 
variability, we have chosen to use simple approaches which do not use prior knowl- 
edge on the documents contents. It justifies the choice of a user-training system based 
on classical bottom-up segmentation. 

The last section gives preliminary results which just show the feasibility to retrieve 
automatically the metadata from medieval manuscripts. We have noticed that the 
results depend more on the quality of the image rather than the complexity of the 
contents. 

1.1 Medieval Latin Manuscripts 

The manuscripts indexing needs great expertise which a computer vision system 
cannot achieved properly. But image analysis can retrieve useful information which 
can help documents indexing. The French Research Institute on Texts History^ is a 
CNRS laboratory which performs fundamental research on ancient manuscripts ante- 
rior to 1500. The IRHT collects and digitizes rare manuscripts with the support of the 
French ministry of culture. They have also participated to the project MASTER 
which aims to the specification of metadata for the Latin language manuscript. We 
have defined different metadata that image analysis systems can retrieve automati- 
cally: 

• Illuminated objects: decorated objects with gold. 

• Main body page: which require considering the main text other than the outside 
objects like: Annotation, notes, drawings, marks for the manufacturer etc. 

• Physical layout: The simplest information needed for finding the book structure. 
The metadata which are interesting for medieval manuscripts are the text baselines 
which guide the copyist in writing, the number of the text lines, the number of col- 
umns and text justification. 

• Miniatures: Pictures or scenes painted on the page. 

• Initial: A large initial capital letter usually painted in a contrasting color or illumi- 
nated in gold. 



^ IRHT: http://www.irht.cnrs.fr 
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• Writing styles: The different style of writing that might exist in the same manu- 
script. 

• Keywords: Important words which define book structures like “incipit”. They can 
be written in different way using different alignments. 

• Separating different text colors: Text is handwritten using different colors with 
alternating red, blue, green and black. The study of color spectrum is also impor- 
tant for research; however, it is achieved by physical and chemical analysis. 
Nevertheless, a costless characterization of colors can be achieved by image 
analysis if the camera is well calibrated and color references stored. 

• Page references: Page numbers are important to locate and to read automatically 
in order to detect missing pages, and for indexing each physical image with the 
correct logical reference of a precise citation. 




Fig. 1. Medieval manuscripts showing illuminated paintings, initials, decorations, decorated 
characters, frames, complex layout. 



We have discarded from our analysis all metadata which require prior knowledge 
and specific processing like the physical layout, the reading of page reference and 
words spotting. On the opposite we keep all the metadata which can be retrieved with 
a generic approach like the writing styles, colored objects, paintings, illuminated 
objects, drop capitals and specific character decoration. 



1.2 The Metadata for the Arabic Manuscripts 

The complex nature of the Arabic language is evident in the cursiveness of the text, 
character overlapping, various character shapes, diacritics and the variety of calli- 
graphic Arabic writing style. Metadata from MASTER project has been extended to 
Arabic manuscripts [2]. One hundred seventy three elements of metadata were cre- 
ated to cover all the aspects of the Arabic manuscripts. This work was a result of 
three main studies: 
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• A study of the literature that talks about the general characteristics of the Arabic 
manuscripts from a point of view codicology and palaeography. 

• A corpus of 21 digitised manuscripts (3200 images) was thoroughly analysed. 

• A questionnaire was distributed among the experts of Arabic manuscripts so as to 
find out their needs. 

The metadata extracted, tried to take into consideration the different aspects of the 
user’s interest, the history of the manuscripts, the codicology, the palaeography, the 
logical structure etc. We could classify the metadata into three main categories: 

a) Bibliographic metadata, which describe the origin of manuscripts (author, title, 
date, place etc.). 

b) metadata that descript content and layout of a manuscript, its logical structure as 
well as the particular characteristics of the manuscripts such as the existence of 
illustrations, the figures, the poem lines, the punctuations the stamps, etc. 

c) Administrative metadata which record physical information on the manuscript 
support (microfilm, digitised copy, etc.), condition of digitization, the entry date of 
the manuscript in the database, the name of the cataloguer etc. 

Our concern in this article is to define metadata from the second category which 
can be retrieved by computer vision systems. They are very similar to metadata de- 
fined for Latin manuscripts, but Arabic manuscripts show some specificities: 

• Main body page: In the opposite of the Latin manuscripts, most of Arabic manu- 
scripts were written inside a framed page, specially the Koran text. 

• The writings on the margin: these writings aims at giving some explanation for 
the text, comments by the reader or in some cases are used to describe an event or 
an action like what is known in Arabic by {al-sama’at wa al-qira’at) means the 
listening and the readings. The writings on the margin are also different in there 
presentation, some are written vertically or horizontally or in zig-zag etc. 

• Ornaments: These important decorations are located in different places of the 
manuscript: on the margin, inside the text, either to replace the punctuation or 
around the title of the chapters in some manuscripts. 

• The chapter title: They are written either in bold or in different colors. But some- 
times the chapter title is written using the same size and color of the text. 

• Illustrations: It takes different forms as well as different sizes. 

• The page layout: In the opposite of Latin Manuscripts, Arabic manuscripts use 
generally a single column. But we found two to three columns for Christian- Arabic 
manuscripts from religious translated text where every column is written in a dif- 
ferent language. The writing in a table or in a circle shape is also very common in 
Arabic manuscripts. The rectangle shape writings are found at the end of the ma- 
jority of the manuscripts and particular in the colophon where the copyist or the 
author ends his manuscript by mentioning the place and the date of finishing his 
work. The three millions Arabic manuscripts that are scattered all around the world 
holds different other characteristics that it is difficult to present it all in this article. 
It’s worth mentioning here that other manuscripts that still not discover may have 
other characteristics that tell now not has been covered by the codicological stud- 



ies. 
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Fig. 2. Samples of Arabic manuscripts and the variety of metadata and images qualities. 



2 The Proposed Document Image Analysis System 

We have developed a generic platform to retrieve information from old manuscripts 
by supervised training. We found this approach more realistic than a fixed rule-based 
system which cannot describe the great variety of multilingual manuscripts. Super- 
vised training system allows users to define the metadata and to build their own 
model for each manuscript. We assume that books have a homogeneous layout and 
keep a regular presentation of the metadata which can be described by a user built 
model. We chose a simple sequential architecture which do not require prior knowl- 
edge or complex rules. We also reject complex architectures or non deterministic 
approaches which use feedback loops, heuristics, multi-agents approaches or systems 
based on interactions. Our objective consists to study the feasibility to process auto- 
matically digitized manuscripts by using a generic platform which can be used by 
non-specialists in image processing and pattern recognition. 

2.1 Image Segmentation 

Image quality of the digitized manuscripts is not regular. To save money and time, 
numerous projects digitize microfilm copy of the manuscript rather than the originals. 
The digitization of microfilms is an easy process which is faster and cheaper than 
using the originals. But in the other hand, it provides bad quality images generally 
using very few gray levels because the microfilm process clarifies the background 
and enhances the contrast. The digitization of microfilm is frequently achieved in hi- 
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level because information represented by other gray levels is not relevant. Conse- 
quently, information loss during the digitization process cannot be restored. 

In order to analyze color, grayscale and bi-level images, we propose to convert 
step-by-step color images into the optimal gray level images then into bi-level im- 
ages. It allows the analysis of images with different color depths by the same process- 
ing line. Between each step, we keep all intermediate images for the image features 
extraction. Users selects the best algorithm which is immediately stored in the user- 
build model associated to the document. We have also developed a specific color 
segmentation algorithm for digitized documents based on a unsupervised classifier 
which use local information computed in a sliding window to adapt color varia- 
tions [3]. 

As we do not have a precise model of the layout for all medieval manuscripts, we 
chose a bottom-up segmentation (data-driven), which does not require prior knowl- 
edge. We localize first all connected components, and then we classify each compo- 
nent according to the features selected by users. Recognized components are gradu- 
ally merged into higher interpreted elements, like text, tables, drawings, decoration, 
frames . . . 




Fig. 3. First step of the bottom-up process : the connected components labeling. 



We provide ready-to-use processing tools which remove noise in bi-level images, 
disconnect touched objects and remove frames. All these tools are based on mathe- 
matical morphology [4]. For example, to remove frames and the shadow of the book 
borders in bi-level images we apply several morphological operations. We measure, 
by morphological convolution, the Ferret diameter in the vertical direction of each 
connected component. The resulting images describe each component with a gray- 
scale level corresponding to the height of the minimal convex hull which contains the 
object (fig 4a). According to a user defined threshold, we build two images, which 
separate small objects which mostly contain texts (fig 4c) from large objects like 
frames and pictures (fig. 4d). Then we compute by morphology, the distance trans- 
form, to measure the thickness of large objects. Objects having a small thickness and 





Automatic Metadata Retrieval from Ancient Manuscripts 8 1 



the largest in height will be deleted (fig 4e). The final image without frame (fig 4f) is 
computed by summation of the text image (fig 4c) and picture image (fig 4e). Similar 
approaches used to disconnect touched components and filter noise of the objects 
contours are described in [5]. 




d) Frames and images 



e) Frame suppression f) Image without frames 



Fig. 4. Frames suppression by morphology. 



2.2 Main Body Segmentation 

We have designed other specialized tools to recognize the main body of a page with- 
out layout extraction. The location of the main text body is a very important issue 
because the classification of objects may change according to their position of the 
main document body. For example, text zone is classified as annotations if it is lo- 
cated outside the main body and regular text if it is inside. The main body may be 
delimited by a frame, but book borders and some figures which might be considered 
as frames could introduce some confusion. We choose to detect the main body by 
locating the text, even though the text is not always justified and filling the entire 
page. First, we estimate the average size of text symbols by computing the size aver- 
age of all components. In a second step, we compute a text probability value for each 
component based on the normalized differences between the size of the component 
and the average text symbol size (Fig. 5b). 

^ , \Size(x)-AverageSize\ 

Max \Size{y)—AverageSize\ 

For each object }' ' 

A high value indicates that the component has a high probability to be a text sym- 
bol. Then, we sum text probability values horizontally and vertically to build X-Y 
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c) X- Y profiles and threshold 





b) Class text probabilities 

•r 
► •« 



r 

d) Other result 



Fig. 5. Main body segmentation using text zones location. 



profiles for the entire image. An automatic threshold is computed for each profile. 
The text bodies, for two pages of one document, are found by taking the maximum 
limits of the profile coordinate having values lower than the computed threshold (Fig. 
5c). This approach works most of the time, even on Latin manuscripts, but the text 
bodies can be oversized in the case of large annotations areas in the margin. 

2.3 Feature Analysis 

After document layout analysis in terms of columns, lines and words, we compute for 
each component a feature vector using measures selected by the users and which 
depends on the metadata to be retrieved and the complexity of the image. We have 
chosen 3 families of features: 

Color features (RGB, YUV, HSL color systems) 

Shape features (object density, compactness, convexity, anisotropy, orientation, 
contour, moments, curvature distribution of the contour, X-Y profiles) (Fig. 6) 

Geometry (object size, maximum diameter in 8 directions, thickness, length...) 
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Fig. 6. a) Curvatures distribution along the contour b) X-Y projections. 



The recognition of some metadata is not only based on the objects shape, size and 
color, we have to take also into account spatial relations between objects. For exam- 
ple, alignments and regularity of words positions are important to recognize text from 
tables. Some chapter titles are recognizable only by the indentation when they use the 
same style and the same size of the regular text. Alignments and distances between 
objects are important features to consider for many metadata. We have measured two 
horizontal alignments ah with the left and right neighbours, two vertical alignments 
av with the top and bottom neighbours and four distances d with the four nearest 
neighbours (Fig. 7). Contrary to the Latin manuscripts, we measure the vertical 
alignments av for the Arabic manuscripts, on the right side, in order to consider the 
reverse direction of reading. 

To take into account the recognition of neighboring objects, we choose to augment 
the feature vector of each component with the features of neighboring objects in the 
four main directions. The dimension of the feature space is too high to build a classi- 
fier with a limited set of training vectors. We reduce the dimension of the feature 
space with an automatic feature combination algorithm using the Principal Compo- 
nents Analysis (PCA). This stage also avoids the user to arbitrary combine or to select 
the features which maximize the clusters separation. 




Fig. 7. a) Spatial relations between objects; b) Spatial features d, ah, av. 
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2.4 Training and Recognition 

Training is an important step which is achieved easily by using a friendly interface. 
The user defines the number of classes and gives a class name for each metadata. The 
interface, figure 15, allows selecting components and their classes of metadata. All 
corrections or new entries are saved in the model which displays the number of train- 
ing vectors for each class of metadata. The training step is accomplished by using 
several representative pages from the same manuscript. The model associated with 
each manuscript store the training vectors, the number of classes of metadata and 
their names, the color conversion process, the thresholding method, the post- 
processing operations (frame removal, disconnection of touched objects, noise re- 
moval) and the features selected by users. We chose a simple K-NN classifier for its 
training speed and for its simplicity. A condensing algorithm is used to reduce the 
number of vectors and improve the recognition speed without decreasing the per- 
formance. We are aiming to use a more appropriated classifier like the SVM, suitable 
to the reduced number of training vectors, but this approach requires a longer training 
step. Recognition is achieved for all connected components segmented within the 
image. We consider each object differently, according to its location inside or outside 
the main page body, in the resulting file. 



3 Experiments and Results 

The platform has been experimented on 6 Arabic manuscripts and 2 Latin manu- 
scripts which make up 1361 single pages of manuscripts. The number of classes of 
metadata differs from a manuscript to another. The recognition results are difficult to 
evaluate precisely because of the lack of ground-truth in this field. A preliminary 
coarse evaluation of the results is enough to determine the feasibility of the proposed 
document analysis system. We notice that the results depend more on the quality of 
image rather than the origin of the manuscripts and its language. We present the re- 
sults both for Latin and Arabic manuscripts classified by image quality. For color 
images directly scanned from the originals, using high resolution, the results are very 
satisfactory and we had obtained successful recognition of a large variety of metadata 
like titles, ornaments, drawings, text delimiters, colored objects, punctuation and 
regular text. The following figures show results for both Arabic and Latin colored 
manuscripts. 

For the given sample of Arabic colored manuscript (fig. 8a), we have defined 4 
logical classes: regular text (fig. 8d), notes (fig. 8c), illuminated ornaments and 
frames (fig. 8f) and punctuation (illuminated circles(fig. 8e)). The difference between 
principal text and notes is based on the positions of objects from the respect to the 
main document body. 

The second example shows different metadata from medieval French manuscripts. 
We have selected very few features like the RGB color system as color features and 
the thickness and average curvatures of components as shape features. 
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e) Ulwmncited piinctiialioii 



f) Illuminated ft limes and drawings 



Fig. 8. Results of metadata extraction into manuscripts from Palestine MS6191 (BM Lyon). 



These simple features are sufficient to perform recognition with a very high accu- 
racy: the original text from the main copyist, the additional writings by the second 
copyist and the colored drop capitals (fig. 9). From these documents, we also recog- 
nize easily the paintings, the illuminated ornaments, the special words written with 
thick lines and text baselines which help the copyist to align the text. Last example 
shows that even for complex and rich documents, the bottom-up approach make pos- 
sible to segment several hundred objects in the same page and classify them correctly 
by using simple features like the color (illuminated and red characters or miniatures), 
the thinness and average curvatures of objects (difference of writing style). But the 
circular stamp from the library located in the center of the page are not correctly clas- 
sified because of the thickness of the components. 
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Fig. 9. Results of metadata extraction into manuscripts from Auxerre AAOl source IRFIT. 




Fig. 10. Results of metadata extraction into manuscripts from Amiens MS0354 source IRHT. 
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Fig. 11. Results of metadata extraction into manuscripts R12051 from BNF. 



















Fig. 12. Errors from manuscripts R12051 due to the over-exposure or the stains. 



For grayscale or binary images scanned from microfilms, the results depend on the 
quality of the support (stains, holes, color of the ink...), the using or not of the JPEG 
compression, the resolution used and the regularity of the image contrast and bright- 
ness. 

The main difficulty with the microfilm digitization is the lack of the regularity of 
exposure and the enhancement of the contrast, which reduce the number of grayscale 
levels, after brightness normalization. For the manuscripts R12051 (fig. 11) digitised 
from microfilms, the quality of images is not sufficient to retrieve correctly illustra- 
tions because of stains and to differentiate black and red texts because of the bright- 
ness variation for the same manuscripts (fig. 12). 
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Fig. 13. Results of metadata extraction into manuscripts R28062 from BNF. 



For the more complex manuscript we have to analyze (R28062 from BNF), the 
quality of the binary image is sufficient to retrieve simple metadata like illustrations, 
tables, text and titles. But our platform cannot process combined metadata. In the 
previous example, it is impossible to make a difference between titles inside tables 
and titles outside tables because they have the same shape and almost the same spatial 
relations. In order to achieve this separation, we have to recognize first the object 
‘table’ and then the object ‘title in the table’. A cascade architecture is required to 
retrieve combined metadata. 

Figure 14 shows that complex metadata like circular tables cannot be recognized 
properly. After the merging step which group progressively connected components 
from the same label, the system finds that the circular table is classified both as illus- 
tration and table. Moreover some random alignments of the diacritics of the titles and 
the numerals include in the text are systematically classified as tables. It shows the 
limit of our simple architecture which cannot retrieve complex metadata without 
specialized algorithms and specific heuristics. 



4 Conclusion and Perspective 

We have developed a DIA system based on a sequential bottom-up process to retrieve 
simple metadata from ancient manuscripts. The approach chosen limit the perform- 
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Fig. 14. Results of metadata extraction into manuscripts R28062 from BNF. 




Fig. 15. Interface during the learning step for the training of class “text” and “title”. 

ance of such tools for complex metadata and bad quality images. Nevertheless, the 
results are promising for images which use sufficient resolution and color depth. This 
approach seems generic for simple metadata but many metadata are still not retrieved 
with our platform like interleaved metadata. So the next step consists on developing a 
cascade architecture to recognize interleaved metadata. We have also noticed two 
major problems during the usage of our package: the problem for users to select in- 
comprehensible and complex parameters during the model creation and the time con- 
suming training step. The productivity of such tools must be evaluated by users in the 
long-term and ergonomic must be improved for better usage by neophytes. 
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Abstract. This paper presents a complete system that historians/archivists can 
use to digitize whole collections of documents relating to personal information. 
The system integrates tools and processes that facilitate scanning, image index- 
ing, document (physical and logical) structure definition, document image 
analysis, recognition, proofreading/correction and semantic tagging. The system 
is described in the context of different types of typewritten documents relating 
to prisoners in World-War II concentration camps and is the result of a multina- 
tional collaboration under the MEMORIAL project funded (€1.5M) by the 
European Union (www.memorial-project.info). Results on a representative se- 
lection of documents show a significant improvement not only in terms of OCR 
accuracy but also in terms of overall time/cost involved in converting these 
documents for digital archives. 



1 Introduction 

The problem of converting collections of documents into digital archives or libraries 
raises a significant number of, often disparate, issues that are mostly specific to the 
given type of document or application. For instance, old/historical manuscripts suffer 
from ageing and often the main concern in their conversion into digital form (digital 
restoration) is the improvement in legibility by human readers (e.g., historians) who 
will most probably also transcribe the handwritten text [3]. Projects involving histori- 
cal documents are, more often than not, small-scale vertical operations that require 
significant human input. On the other end of the spectrum, relatively modern printed 
documents do not suffer from significant substrate/ink degradation problems and lend 
themselves to a higher degree of automated processing, OCR (albeit not trivial) and 
more sophisticated content extraction and indexing [4]. Finally, there are specific 
applications such as the conversion of administrative documents, which are typically 
forms with fixed structure [1]. 

This paper presents a comprehensive approach to the conversion of large collec- 
tions of documents that exhibit most of the issues outlined above. A complete frame- 
work is described that comprises software tools and quality evaluation driven work- 
flow procedures that are intended to be used by historians/archivists (i.e., non- 
technical but subject-conversant users) to convert a broad range of documents starting 
from scanning and achieving a semantic representation as the end goal. 
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Contrary perhaps to typical conversion applications where the documents to be 
converted are all present in one physical location, the approach described here is in- 
tended to address situations where historical documents are dispersed across different 
institutions of varying financial and technical means. Therefore, the approach has to 
be flexible, cost-effective and de-centralised, while ensuring high quality of results 
across a number of users and document classes. 

The approach is being developed by an international consortium as the primary 
goal of the “MEMORIAL” project (www.memorial-project.info) funded by the Euro- 
pean Union - Eifth Eramework Programme: Information Society Technologies prior- 
ity (61.5M). The full title of the project is: “A Digital Document Workbench for Pres- 
ervation of Personal Records in Virtual Memorials”, which also hints to the nature of 
the document dataset selected for study: documents that contain information about 
people. 

The documents used to demonstrate the proposed conversion approach are all from 
the middle of the 20* century but in their majority suffer (sometimes severely) from 
age, handling and production-related degradation. The majority of the information on 
the documents is typewritten (handwritten annotations, signatures and stamps are also 
present) in a variety of formats and underlying logical (semantic) structures. 

The particular documents constitute unique sources (no copies exist elsewhere) of 
personal information that amount to the only record and proof of existence for many 
thousands of people that passed through Nazi concentration camps. Large collections 
of documents containing this type of information are closely guarded in various indi- 
vidual museums and archives making access to the whole body of knowledge con- 
tained in them virtually inaccessible. Naturally, the significance of the whole ap- 
proach and of the document image analysis methods involved, in particular, expands 
far beyond the chosen document class into practically any typewritten document 
(which includes enormous numbers of documents of the 20* century). 

There is scarcely any report in the literature of conversion of this type of typewrit- 
ten documents into a logically indexed, searchable form. A notable exception is a 
project to convert file cards from the archives of the Natural History Museum in Lon- 
don, UK [2]. This project involved the digitisation of (mostly typewritten) index cards 
with a bank-cheque scanner and the subsequent curator-assisted extraction and recog- 
nition of taxonomic terms and annotations. The results are indexed hierarchically in a 
database. 

The work described in this paper is of a different nature and necessitates a rela- 
tively different approach. Eirst, many of the documents (as in most archives) are frag- 
ile, and curators heavily resist mass scanning. Second, the paper is frequently dam- 
aged by use and decay and, sometimes, heavily stained. Third, the characters typed on 
the paper may not be the result of direct impression but of impression through the 
original paper and a carbon sheet as well (characters in carbon copies are frequently 
blurred and joined together). Einally, there may not be as ordered a logical structure in 
the text and position of documents as in a taxonomy card index, for instance (although 
there usually is some logical information that historians / archivists are able to spec- 
ify). 

The implication of the above issues is that there is a requirement for more involved 
and, at the same time, more generic document analysis. The volume of text and the 
relatively unrestricted dictionary possibilities evident in many of the documents does 
not permit the use of experimental (purpose-built) OCR. An off-the-shelf OCR pack- 
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age is used, for which the characters are segmented and individually enhanced in 
advance by the methods developed for the project. 

Two examples of document classes exhibiting typical characteristics of a variety of 
layouts and conditions are taken from the Stutthof camp (http://www.stutthof.pl): 

■ Transport lists - 3 kinds of documents, where names of Stutthof prisoners are 
present - 1) lists of prisoners that arrived at Stutthof, 2) lists of prisoners that were 
moved from Stutthof to other camps, and 3) lists of prisoners freed from the camp. 
A sample transport list can be seen in Fig. 1 (a). Similar transport lists (as compiled 
by the SS) exist in a number of museums and archives throughout Europe. 

■ Catalogue cards - these are cards created by historians shortly after the end of the 
war and contain information about people from various sources (as typewritten text 
and stamps). There are about 200,000 of these cards in Stutthof alone and, in terms 
of the project, provide a different class of documents but with typewritten text 
fields to test the applicability of the framework developed. An example catalogue 
card can be seen in Fig. 1 (b). 




Fig. 1. (a) Example of a transport list, (b) Example of a catalogue card. 



The remainder of the paper presents each of the steps involved in the complete 
conversion approach. More precisely. Section 2 outlines the document input phase. 
Section 3 describes the document structure definition processes. Section 4 details each 
of the following document image analysis steps: segmentation of background entities, 
layout analysis, character location and image based character enhancement. The char- 
acter recognition and post-processing stage is summarised in Section 5, while a brief 
description of the final web-enabled system is given in Section 6. An assessment of 
the effectiveness of the whole approach concludes the paper in Section 7. 
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2 Document Input 

The documents exist in a variety of physical conditions (in terms of damage and de- 
cay). The transport lists, especially, most often are the duplicate pages (carbon copies) 
produced when the original lists were typed. As such, the surviving documents are 
printed on rice (a.k.a. Japanese) paper, which is very thin, rendering the use of an 
automatic document feeder impossible. 

There are other important issues raised by the type of the paper (see Fig. 1). First, 
there is a background texture, which is very prominent as multi-colour noise in the 
colour scans. Scanning can potentially improve the quality of the resulting image 
(studies were carried out with a variety of scanners and set-ups), although historians 
prefer to see a facsimile of the original (with the background texture) when they study 
the documents. The decision was made to retain the fidelity of the scanned documents 
to the originals and to place the burden on the image analysis stage. 

Second, typewritten characters may not be sharply defined but blurred and often 
faint, depending on the amount of force used in striking the typewriter keys. This 
situation is exacerbated in the case of carbon copies (when the force was not enough 
to carry through to the rice paper or the quality of the carbon was not so good). The 
decision was made to scan the pages without any attempt to improve them during 
scanning and, therefore, defer processing to the image analysis stage. 

One requirement imposed for consistency of further processing, is that documents 
are scanned against a dark colour background (covering the scanning area surround- 
ing the paper document). 

Documents are scanned in 300dpi, in 24-bit colour TIFF format (with lossless 
compression). An indexing tool created as part of the project handles the scanned 
documents, collects simple metadata information from the historian/archivist perform- 
ing the scan, and enters the data into a working repository. The working repository is 
effectively an internal database, which is accessed from all the subsequent processes 
to retrieve and store information about a document. 

3 Document Structure Definition 

For many classes of archive documents, it is possible to define a correspondence be- 
tween the physical and the logical structure of a document. This is especially the case 
for documents having a fixed form-type physical structure. In the case of the transport 
lists, there is definitely a logical structure (in an oversimplified way: certain blocks of 
information followed by lists of personal information, followed by certain closing 
blocks of information) but there is not a fixed layout correspondence. It is the respon- 
sibility of the historian/archivist user of the final system to group together very similar 
documents and, using a tool developed by the consortium, create a template where 
physical (generic) entities on a page are associated with logical information. 

The template creation process takes into account an XML specification of a base 
(generic) document layout model [5]. An interactive template editor (Fig. 2) as well as 
two helper applications are provided in the system to facilitate the creation of tem- 
plate XML files for the documents to be converted. The template editor, allows the 
users to work directly on the document image canvas and define the layout compo- 
nents using easy drag and drop procedures. 
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Fig. 2. A screenshot of the interactive template editor. 



The subsequent extraction of the textual content of the document is guided by the 
document template. The template is an initially empty container XML structure. This 
structure specifies generic regions (as rectangles) of a predefined type of content (e.g. 
table block, salutation block etc.). Region specifications are interpreted by the docu- 
ment analysis methods to extract precise text regions (e.g. textlines, table cells etc.) 
from the scanned document pages. This (geometrical parameter) information is then 
inserted into the respective regions of the XML template to produce content XML 
files (“filled” document structure) - one for each respective page of input. 

The OCR process subsequently looks up each of the content XML regions in cor- 
respondence with the enhanced image in order to resolve problems with recognising 
individual words and groups of characters (using corresponding dictionaries - peo- 
ples’ first names, geographical place names etc.). The recognised text (possibly after 
additional human editing) is the edited in the appropriate positions in the content 
XML, completing the document conversion process. 



4 Document Image Analysis 

The main goal of the image analysis step is to prepare the ground for optimal OCR 
performance (compensating for off-the-shelf OCR inadequacy to deal with the docu- 
ment class in hand). This goal is in reality twofold. First, the quality of the image data 
has to be improved to the largest possible extent afforded by the application. Starting 
with a colour-scanned document with a large number of artefacts (noisy background, 
paper discolouration, creases, and blurred, merged and faint text, to name but a few) 
the result must be a bi-level improved image where characters are enhanced (seg- 
mented, restored and faint ones retrieved from the background) as much as possible. 

Second, individual semantic entities must be precisely located and described in the 
content XML structure. The required level of abstraction for the semantic entities is 
defined in advance. The document template XML structure coarsely outlines the loca- 
tion of regions in the image (e.g., there is a table region contained within a given 
notional rectangle). The document image analysis methods must locate the required 
instances of logical entities (e.g., individual table cells) and enter this information in 
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the content XML. Using the resulting content XML as a map, the OCR process can be 
directed to that specific location in the image and fill-in the recognised text into the 
content XML structure. 

The main steps of the document image analysis stage are; segmentation of back- 
ground entities, character location and character improvement. Each of these steps is 
described in the remainder of this section. 



4.1 Segmentation of Background Entities 

The first step in the image analysis chain is to locate the paper document inside the 
document image. Due to the requirement to scan each document against a dark green 
background, a dark outer region surrounding the document exists in every image 
(Fig. 1). To identify the dark outer edge, the Lightness component of the HLS repre- 
sentation of each pixel is examined. 

The process of outer edge segmentation starts by examining the edge pixels of the 
image in each of the four edges (top, left, bottom, right). Starting with each edge 
pixel, we move inwards, and the difference in Lightness of each pair of adjacent pix- 
els is checked. If the difference is found to be above an experimentally defined 
threshold, the pixel is marked as a potential paper-edge one. A pixel is also marked as 
a potential paper-edge one, if the difference in Lightness between the current pixel 
and the average of the previous pixels examined (in this row or column) is above the 
same threshold as before. This ensures that gradient transitions from the dark outer- 
edge to paper will also be identified. 

The paper edges are expected to be approximately straight; therefore, the pixels 
identified as potential paper-edge ones at the previous step, are further inspected and 
any existent spikes (pixels having a large displacement comparing to their neighbour- 
ing ones) are eliminated. 

A straight line is subsequently fitted on each of the four potential paper edges, and 
a second check is performed, based on the straight lines identified, in order to elimi- 
nate any wider spikes that cross the fitted line and were missed in the first round. 
Finally, the outer edge pixels are labelled as such. 

Based on the fitted lines, two problems can be addressed. First, assuming that the 
edges of the original paper are mostly straight and pair-wise perpendicular, a first 
attempt is made to calculate and correct the skew of the document. This correction 
step appears to be sufficient in the majority of cases. Second, the origin (top-left cor- 
ner) of the page can be identified, which is important in terms of matching the tem- 
plate provided to the image at hand. 

A different type of background entity that is of interest and needs to be segmented 
is that of areas of reconstructed paper. The presence of this type of areas is an artefact 
resulting from earlier document restoration attempts, where missing paper (due to 
tears, holes etc.) is “grown” back using liquid paper (e.g., see especially the bottom- 
right and top-left corners of the page image in Fig. 1 (a)). Certain regions of printed 
information share similar colour characteristics as reconstructed paper areas; there- 
fore, it is important that such areas are segmented before the character segmentation 
process takes place, to avoid any subsequent misclassification. The segmentation of 
reconstructed-paper areas is performed in two steps. First, potential areas of recon- 
structed paper are identified in the image, based on their colour characteristics. Sub- 
sequently, the identified regions are filtered based on their location in the image. 
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In order to segment the reconstructed-paper areas in an image, the Lightness and 
Saturation components of the HLS colour system are examined. During this analysis, 
all pixels having Saturation and Lightness values in certain ranges, experimentally 
derived, are labelled as potential parts of a reconstructed paper area. 

Subsequently, connected component analysis is performed, and the recon- 
structed-paper labelled pixels are organized into components. The resulting compo- 
nents are then examined, and the ones touching the outer edge (identified before) are 
being kept as true reconstructed-paper areas, whereas the rest are discarded. This is 
because reconstructed-paper areas only appear at the edge of the page. As a post- 
processing step, a closing operation takes place (combined dilation and erosion opera- 
tions), in order to close any internal gaps in the connected components identified. 

The result of this pre-processing step can be seen in Fig. 3, where the identified 
surrounding area is shown in green and the reconstructed paper areas in orange. 



4.2 Character Location 

In order to improve the text regions to the effect that merged characters are separated, 
and faint ones are “lifted” from the background, the approach described here performs 
an individual character location and enhancement process. This approach is novel in 
this type of application and is afforded by the regularity of the typewriter font. 

In order to locate individual characters in the image, a top-down approach is fol- 
lowed. First, the regions of interest are looked up in the XML document template. 
This minimizes the overall processing effort required, since character location only 
takes place within given areas instead of the whole image (although the methods can 
be extended to work with the whole image). For each text region of the template, a 
two-step process takes place to locate the characters: first, the identification of textli- 
nes in the region is performed, and then for each textline extracted, the characters 
within it are segmented. 
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Information retrieved by the document template is employed to facilitate character 
location. The main properties of the typewriter font exploited here are the constant 
font size (character height and maximum width) and the fact that there are no vertical 
overlaps between parts of adjacent characters. This a-priori knowledge of the charac- 
ter set used (provided in the document template) allows for an initial prediction about 
the spacing between textlines and between adjacent characters to be made. 

The identification of textlines in a text region is based on the analysis of the verti- 
cal projection of the region (based on the Lightness component). Each textline is ex- 
pected to contribute at least one local maximum (high count of dark pixels) in the 
vertical projection. Moreover, this local maximum is expected to appear close to the 
centre of each text line, thus to have a distance larger than a pre-defined threshold 
from its left and right minima. Based on this heuristic, an initial filtering of maxima in 
the vertical projection takes place, so that only those that potentially signify textlines 
are considered. 

For each maximum identified we examine the distance between itself and the 
maxima that follow. As long as two subsequent maxima present a small distance 
between them (defined based on character set information) and are of similar strength, 
they are considered to belong to the same text line. The group of maxima identified 
defines the broad area where a textline lies. Based on that, the centre of the textline is 
defined, and a provisional top and bottom textline separator positions are hypothe- 
sised based on known information about the textline’s height. The local minima on 
the left and right of the area identified are initially labelled as the top and bottom 
separators of the textline. The top (and bottom) separator is then moved towards the 
provisional top (or bottom) position, until a steep change occurs in the projection 
histogram, or the provisional position is reached. 

Finally, the separators produced for the textline are examined against previously 
identified textlines, and certain amendments take place to ensure that textlines appear 
in a continuous fashion and any white space between them is properly discarded. 

For each of the textlines identified before, a similar strategy is followed to separate 
individual characters. The horizontal projection of each textline is used in this case. 
First, any white space found on the left of the characters needs to be discarded. In 
order to do so, the maxima of the projection histogram are located, and the first 
maximum of some importance (strength higher than a threshold) is identified. 

Based on the position of the first important maximum (that corresponds to a char- 
acter) the first character separator is placed on the local minimum on its left. Using 
information about the fixed width of the characters (character set information from 
the template), we can project the position of the next character separator based on the 
first one. Every minimum that lies plus or minus a fixed width from the projected 
separator is considered as a potential next character separator, and is scored according 
to its strength, and its distance from the projected character separator. The minimum 
with the highest score is labelled as the next character separator and the process is 
repeated. 

By locating individual characters within textlines, an important problem which of- 
ten hinders the OCR stage is readily addressed: merged characters (characters that are 
touching in the original image) can now be separated. An example of individual char- 
acters precisely located within the document image can be seen in Fig. 4. It can be 
seen that, apart from the local thresholding of the characters, the separators correctly 
split characters that were merged in the original image (e.g. “GER” in “KONZEN- 
TRATIONSLAGER”). 
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Fig. 4. A field in a catalogue card and the corresponding result after character location and 
enhancement. 



4.3 Image Based Character Enhancement 

Having identified the position of all characters in the image, local processing can take 
place for each character. This processing, aims at improving the characters and pro- 
ducing a black and white image of the character, which will be used by OCR in the 
next stage. A local (individual character) approach produces much better results in the 
case of typewritten documents, since individual characters are usually pressed in dif- 
ferent strengths. 

A number of contrast enhancement and adaptive thresholding approaches can be 
performed at this point. Experimentation still takes place but as an initial approach, 
the method proposed by Niblack [6] has been adopted. This initial decision was made 
after consideration of a number of alternatives (including variants of histogram 
equalisation techniques [8] and Weszka and Rosenfeld’s [9] approach). A key charac- 
teristic of Niblack’ s method seems to be the accurate preservation of character edges 
while, however, it does not perform very well on areas of degraded background not 
containing any pixels of a character. A sample result can be seen in Fig. 5. 



A. “.St. ochlo.; ; r 

A . St • .S.c.hI.ojs..g_er... 

Fig. 5. Detail of text showing faint and strongly pressed characters properly recovered. 

It should be noted that very encouraging results have been obtained, with merged 
characters correctly separated and faint characters (previously classified as back- 
ground) recovered. The ability to locate individual characters constitutes a very sig- 
nificant benefit for any enhancement process and this is one of the characteristic ad- 
vantages of this project. 



5 Character Recognition and Post-processing 

An off-the-shelf OCR package is given the enhanced image and the location of each 
logical entity (from the intermediate content XML structure). At the end of this step, 
the recognised characters are inserted in the content XML structure. 

The OCR package cannot be trained directly on the document class in hand. How- 
ever, the results of the OCR are post-processed taking into account the type of the 
logical entity to which they correspond. For instance, if the logical entity is a date, 
only digits and separators (e.g., hyphens) are considered and the result can be further 
validated. Similarly for surnames, names and placenames, although the frequent prac- 
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tice of using German spelling (and re-naming) of places and peoples’ names make the 
process more complicated. 

Experiments are still being carried out to establish the full extent of measurable 
improvement in recognition rate as opposed to applying OCR to the original image 
(before enhancement). Initial figures (from a respresentative sample including images 
of different levels of quality) indicate that the whole approach presented here is bene- 
ficial to direct OCR. More specifically, 95.5% of the characters were recognised cor- 
rectly as opposed to 71.4% with the direct OCR. Furthermore, 84.2% of the words 
were correctly recognized, in contrast to 52.3% with direct OCR. It should be noted, 
however, that the measure of overall effectiveness of the complete approach (Sec- 
tion 7) is a more meaningful measure and the reader is referred to that. 



6 Web Database and User Access 

The project will create a prototype portal where historians, government officials and 
the public can initiate a query through the web (a pilot can be seen at website 
www.memorial-project.info). The final approved content XML document structures 
will form the basis for extracting selected information to include in a web database 
application. Each type of user will be authenticated first and then receive the result of 
their query applied to each of the participating archives (appropriately censored ac- 
cording to their user status). 



7 Effectiveness of the Approach 

The MEMORIAL project is still under way, therefore any results obtained are pre- 
liminary at this point in time. Further development is taking place on the character 
enhancement stage and more document classes are being introduced to test each of the 
components of the approach. However, the results presented here are largely represen- 
tative of the performance of the system, albeit they are given on a smaller number of 
document classes. The documents that contribute to the following discussion are the 
(original) file cards and the “transport list forgeries”. The latter are documents faith- 
fully reproduced by historians on original paper and typewriters dating from the time 
of the transport lists and having the same layout structure as the originals. The reasons 
for creating these realistic “forgeries” is to initially test the system with realistic im- 
ages that can be shown and published (as opposed to the original documents contain- 
ing personal information) at an early stage of the project. 

The effectiveness of the whole approach is assessed by evaluating acceptance- 
testing scenarios. Two are discussed here, for reasons of brevity (following each of 
the uniform and semantic models). Each model takes into account four metrics: the 
average OCR confidence level (as output by the package), the percentage of correctly 
recognised characters, the percentage of correctly recognised words and, finally, the 
document preparation time ratio (indicating time/cost savings as opposed to human 
transcription). Quality (effectiveness of the system) is expressed in the range of 0-1. 
Three different cases are compared in terms of quality value: the direct application of 
the off-the-shelf package to the document, the application of the OCR package fol- 
lowing thresholding by Otsu’s method [7], and finally, the comprehensive approach 
of the MEMORIAL project. 
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The uniform model (shown in Fig. 6) applies equal weights to each of the metrics. 
On the other hand, the semantic model (shown in Fig. 7) is biased towards the time- 
ratio, indicating the effectiveness of the system in terms of the time and cost benefit 
as opposed to human transcription. It is evident that the whole approach constitutes an 
overall improvement to both the manual transcription and to the semi-automated ap- 
plication of off-the-shelf packages. Moreover, the richness of information (semanti- 
cally tagged) obtained by the MEMORIAL approach is far superior to the output of 
generic OCR. 



Uniform Quality Assessment Model 




- Direct OCR 
-Otsu * OCR 
-MEMORIAL 



Fig. 6. Uniform quality assessment model graph. 



Semantic Quality Assessment Model 




-Oirecl OCR 
-Olsu + OCR 
-MEMORIAL 



Fig. 7. Semantic quality assessment model graph. 



8 Concluding Remarks 

This paper has presented a series of processes and tools, with an emphasis on docu- 
ment analysis, that constitute a comprehensive framework to convert historical type- 
written (but not necessarily limited to that) documents. Historical documents are 
unique in many ways and dealing with them requires special consideration both in 
terms of methods and in the overall thinking required. The richness of information 
contained in such documents (which needs to be reflected in the final digital represen- 
tation) and the artefacts due to degradation / heavy use (rendering most well estab- 
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lished methods useless) strongly indicate a necessity for a paradigm shift from meth- 
ods designed to be used by technically-oriented people to systems to be used (and 
incorporate the significant expertise of) historians/archivists. This is the most impor- 
tant lesson learned. 

In terms of technical observations, the most important one is that the “improved” 
(the result of all the stages prior to OCR) image that is the most visually appealing to 
users is not necessarily the one that gives the best OCR results. Experimentation is 
necessary before making behind-the-scenes decisions since an off-the-shelf OCR 
package is used here. Moreover, the variety of paper types and preservation (or deg- 
radation, rather) states encountered necessitates a large degree of flexibility even for 
documents produced with the same typewriter at the same point in time. Work con- 
tinues on these aspects to improve the overall effectiveness of the approach and ex- 
tend it for use with more types of documents. 
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Abstract. Historical document collections are a valuable resource for human 
history. This paper proposes a novel digital image binarization scheme for low 
quality historical documents allowing further content exploitation in an efficient 
way. The proposed scheme consists of five distinct steps: a pre-processing pro- 
cedure using a low-pass Wiener filter, a rough estimation of foreground regions 
using Niblack’s approach, a background surface calculation by interpolating 
neighboring background intensities, a thresholding by combining the calculated 
background surface with the original image and finally a post-processing step in 
order to improve the quality of text regions and preserve stroke connectivity. 
The proposed methodology works with great success even in cases of historical 
manuscripts with poor quality, shadows, nonuniform illumination, low contrast, 
large signal-dependent noise, smear and strain. After testing the proposed 
method on numerous low quality historical manuscripts, it has turned out that 
our methodology performs better compared to current state-of-the-art adaptive 
thresholding techniques. 



1 Introduction 

It is common that documents belonging to historical collections are poorly preserved 
and prone to degradation processes. This work aims at leveraging state-of-the-art 
techniques in digital image binarization and text identification for digitized documents 
allowing further content exploitation in an efficient way. The method has been devel- 
oped in the framework of the Greek GSRT-funded R&D project, D-SCRIBE, which 
aims to develop an integrated system for digitization and processing of old Greek 
manuscripts. 

Binarization (threshold selection) is the starting step of most document image 
analysis systems and refers to the conversion of the gray-scale image to a binary im- 
age. Binarization is a key step in document image processing modules since a good 
binarization sets the base for successful segmentation and recognition of characters. In 
old document processing, binarization usually distinguishes text areas from back- 
ground areas, so it is used as a text locating technique. In the literature, the binariza- 
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tion is usually reported to be performed either globally or locally. The global methods 
(global thresholding) use a single threshold value to classify image pixels into object 
or background classes [1-5], whereas the local schemes (adaptive thresholding) can 
use multiple values selected according to the local area information [6,7]. Most of the 
proposed algorithms for optimum image binarization rely on statistical methods, with- 
out taking into account the special nature of document images [8-10]. Global thresh- 
olding methods are not sufficient for document image binarization since document 
images usually have poor quality, shadows, nonuniform illumination, low contrast, 
large signal-dependent noise, smear and strains. Instead, adaptive to local information 
techniques for document binarization have been developed [11-14]. 

In this paper, a novel adaptive thresholding scheme is introduced in order to bi- 
narize low quality historical documents and locate meaningful textual information. 
The proposed scheme consists of five basic steps. The first step is dedicated to a de- 
noising procedure using a low-pass Wiener filter. We use an adaptive Wiener method 
based on statistics estimated from a local neighborhood of each pixel. In the second 
step, we use Niblack’s approach for a first rough estimation of foreground regions. 
Usually, the foreground pixels are a subset of Niblack’s result since Niblack’s method 
usually introduces extra noise. In the third step, we compute the background surface of 
the image hy interpolating neighboring background intensities into the foreground 
areas that result from Nihlack’s method. A similar approach has been proposed for 
binarizing camera images [15]. In the fourth step, we proceed to final thresholding hy 
combining the calculated background surface with the original image. Text areas are 
located if the distance of the original image from the calculated background exceeds a 
threshold. This threshold adapts to the gray-scale value of the background surface in 
order to preserve textual information even in very dark background areas. In the final 
step, a post-processing technique is used in order to eliminate noise pixels, improve 
the quality of text regions and preserve stroke connectivity. The proposed method was 
tested with a variety of low quality historical manuscripts and it turned out that it is 
superior to current state-of-the-art adaptive thresholding techniques. 



2 Previous Work 

Among the most known approaches for adaptive thresholding is Niblack’s method [8] 
and Sauvola’s method [11]. 

Niblack’s algorithm [8] calculates a pixelwise threshold by shifting a rectangular 
window across the image. The threshold T for the center pixel of the window is com- 
puted using the mean m and the variance s of the gray values in the window: 

T = m + k s (1) 

where A: is a constant set to -0.2. The value of k is used to determine how much of the 
total print object boundary is taken as a part of the given object. This method can dis- 
tinguish the object from the background effectively in the areas close to the objects. 
The results are not very sensitive to the window size as long as the window covers at 
least 1-2 characters. However, noise that is present in the background remains domi- 
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nant in the final binary image. Consequently, if the objects are sparse in an image, a 
lot of background noise will be left. Sauvola’s method [11] solves this problem by 
adding a hypothesis on the gray values of text and background pixels (text pixels have 
gray values near 0 and background pixels have gray values near 255), which results in 
the following formula for the threshold: 

T=m + {\-k{\-slR)) (2) 

where R is the dynamic range of the standard deviation fixed to 128 and k is fixed to 
0.5. This method gives better results for document images. 

3 Methodology 

The proposed methodology for low quality historical document binarization and text 
preservation is illustrated in Fig. 1 and fully described in this section. 
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Fig. 1. Block diagram of the proposed methodology for low quality historical document text 
preservation. 



3.1 Stagel: Pre-processing 

Since historical document collections are usually of very low quality, a pre-processing 
stage of the grayscale source image is essential in order to eliminate noise areas, to 
smooth the background texture and to highlight the contrast between background and 
text areas. The use of a low-pass Wiener filter [16] has proved efficient for the above 
goals. Wiener filter is commonly used in filtering theory for image restoration. Our 
pre-processing module implements an adaptive Wiener method based on statistics 
estimated from a local neighborhood around each pixel. The grayscale source image 
is transformed to grayscale image I according to the following formula: 

I{x,y) = /u + ( (tW )( I (x,y) -fx )! a (3) 

where n is the local mean and a the variance at a AhcM neighborhood around each 
pixel. We used a 3x3 Wiener filter for documents with 1-2 pixel wide characters oth- 
erwise we used a 5x5 kernel. Fig. 2 shows the results of applying a 3x3 Wiener filter 
to a document image. 
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Fig. 2. Image pre-processing: (a) Original image; (b) 3x3 Wiener filter. 



3.2 Stage2: Rough Estimation of Foreground Regions 

At this step of our approach, we want to obtain a rough estimation of foreground re- 
gions. Our intention is to proceed to an initial segmentation to foreground and back- 
ground regions and find a set of foreground pixels that is a superset of the correct set 
of foreground pixels. In other words, we intend to obtain a set of pixels that contains 
the foreground pixels plus some noise. Niblack’s approach for adaptive thresholding 
[8] is suitable for this case since Niblack’s method usually detects text regions but 
introduces extra noise (see Fig. 3). At this step, we process image I{x,y) in order to 
extract the binary image N(x,y) that has I’s for the rough estimated foreground re- 
gions. 




Fig. 3. Adaptive thresholding using Niblack’s approach: (a) Original image; (b) Estimation of 
foreground regions. 



3.3 Stage3: Background Surface Estimation 

At this stage, we compute an approximate background surface B(x,y) of the image. 
The pixels of the pre-processed source image I(x,y) belong to the background surface 
B(x,y) only if the corresponding pixels of the resulting rough estimated foreground 
image N{x,y) have zero values. The remaining values of surface B{x,y) are interpolated 
from neighboring pixels. The formula for B{x,y) calculation is as follows: 
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if N{x, y) = 0 



x+dx y+dy 



B{x,y) 






( 4 ) 



i=x-dx j=y—dy 



x+dx y+dy 






i=x~dx j=y~dy 



The interpolation window of size dx x dy is defined to cover at least two image charac- 
ters. An example of the background surface estimation is demonstrated in Fig. 4. 



Fig. 4. Background surface estimation: (a) Original image /; (b) Background surface B. 

3.4 Stage4: Final Thresholding 

In this step, we proceed to final thresholding by combining the calculated background 
surface B with the original image I. Text areas are located if the distance of the origi- 
nal image from the calculated background is over a threshold d. We suggest that the 
threshold d must change according to the gray-scale value of the background surface B 
in order to preserve textual information even in very dark background areas. For this 
reason, we propose a threshold d that has smaller values for darker regions. The final 
binary image T is given by the following formula: 



A typical histogram of a document image (see Fig. 5) has two peaks, one for text 
regions and one for background regions. The average distance 5 between the fore- 
ground and background can be calculated by the following formula; 



1 







(a) 



(b) 




(5) 
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Fig. 5. Document image histogram: (a) Original image; (b) Gray level histogram. 




A- y 

For the usual case of document images with uniform illumination, the minimum 
threshold d between text pixels and background pixels can be defined with success as 
q d where ^ is a variable near 0.8 that helps preserving the total character body in 
order to have successful OCR results [15]. In the case of very old documents with low 
quality and nonuniform illumination we want to have a smaller value for the threshold 
d for the case of darker regions since it is a usual case to have text in dark regions with 
small foreground-background distance. To achieve this, we first compute the average 
background values b of the background surface B that correspond to the text areas of 
image N~. 

y)(l - N(x, y))) 

^ ~ ( 7 ') 

A y 

We wish the threshold to be approximately equal to the value q S when the back- 
ground is large (roughly greater than the average background value b) and approxi- 
mately equal to p^qS when the background is small (roughly less than p^ b) with 
Pi,P 2 ^ [0,1]. To simulate this desired behaviour, we use the following logistic sig- 
moid function that exhibits the desired saturation behaviour for large and small values 
of the background as shown in Fig. 6: 



d{B(x, y)) = q 



S 
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Fig. 6. Function d(B(x,y)). 



After experimental work, for the case of old manuscripts, we suggest the following 
parameter values: q = 0.6, = 0.5, = 0.8. 

3.5 stages : Post-processing 

In the final step, we proceed to post-processing of the resulting b/w image in order to 
eliminate noise, improve the quality of text regions and preserve stroke connectivity 
by isolated pixel removal and filling of possible breaks, gaps or holes. Below follows 
a detailed step-by-step description of the post-processing algorithm that consists of a 
consequent application of shrink and swell filtering [17]. 

Step 1: A shrink filter is used to remove noise from the background. The entire b/w 
image is scanned and each foreground pixel is examined. If is the number of back- 
ground pixels in a sliding nxn window, which has the foreground pixel as the central 
pixel, then this pixel is changed to background if where can be defined ex- 

perimentally. 

Step 2: A swell filter is used to fill possible breaks, gaps or holes in the foreground. 
The entire b/w image is scanned and each background pixel is examined. If is the 
number of foreground pixels in a sliding nxn window, which has the background pixel 
(x,y) as the central pixel, and the average values for all foreground pixels in the 
nxn window, then this pixel is changed to foreground if Pj>k^ and |x-xj<dx and 
\y~y}[<dy. The latter two conditions are used in order to prevent an increase in the 
thickness of character strokes since we examine only background pixels among uni- 
formly distributed foreground pixels. 

Step 3: An extension of the above conditions, leads to a further application of a swell 
filter that is used to improve the quality of the character strokes. The entire b/w image 
is scanned and each background pixel is examined. If P^, is the number of foreground 
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pixels in a sliding nxn window, which has the background pixel as the central pixel, 
then this pixel is changed to foreground if 

After experimental work, for the case of old manuscripts with an average letter 
height of /^, we suggest the following parameter values for the post-processing phase: 
n=0.151j, k^=Q.9n, kj=Q.Q5n, dx=dy=0.25n, k^=Q35n. An example of a resulting 
b/w image after post-processing steps is given in Fig. 7. 






(c) (d) 



Fig. 7. Post-processing stage: (a) BAV image resulting after final thresholding stage; (b) Image 
after post-processing step 1; (c) Image after post-processing step 2; (d) Final image after post- 
processing. 



4 Experimental Results 

In this section, we compare the performance of our algorithm with those of Otsu [2], 
Niblack [8] and Sauvola et al. [11]. The testing set includes images from both hand- 
written and typed historical manuscripts. All images are of poor quality and have 
shadows, nonuniform illumination, smear and strain. Fig. 8 demonstrates an example 
of a typed manuscript, while Fig. 9, 10 demonstrate examples of handwritten manu- 
scripts. The final post-processing task is demonstrated at Fig. 10. As shown in all 
cases, our algorithm out-performs all the rest of the algorithms in preservation of 
meaningful textual information. 
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Fig. 8. Experimental results - binarization of a typed manuscript (a) Original image; (b) Otsu’s 
method; (c) Niblack’s method; (d) Sauvola’s method; (e) The proposed method (without post- 
processing). 



In order to extract some quantitative results for the efficiency of the proposed hi- 
narization method, we compared the results obtained by the well-known OCR engine 
FineReader 6 [18] with and without applying the proposed binarization technique, as 
well as with applying the binarization techniques of Niblack [8] and of Sauvola et al. 
[11]. To measure the quality of the OCR results we calculated the Levenshtein dis- 
tance [19] between the correct text (ground truth) and the resulting text. As shown at 
table 1, in all cases, the recognition results were improved after applying the proposed 
binarization technique. The application of the other two binarization techniques results 
in worse results in most cases. 



Table 1. Representative OCR results after applying several binarization schemes. 





Levenshtein Distance from the Ground truth 




Document 1 


Document 2 


Document 3 


Document 4 


Original image 


68 


221 


185 


408 


Niblack’ s method 


228 


619 


513 


447 


Sauvola’s method 


60 


394 


276 


694 


The proposed method 


56 


207 


177 


153 
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Fig. 9. Experimental results - binarization of a handwritten manuscript (a) Original image; (b) 
Otsu’s method; (c) Niblack’s method; (d) Sauvola’s method; (e) The proposed method (without 
post-processing). 



5 Conclusions 

In this paper, we present a novel methodology that leads to preservation of meaningful 
textual information in low quality historical documents. The proposed scheme consists 
of five distinct steps: a pre-processing procedure using a low-pass Wiener filter, a 
rough estimation of foreground regions using Niblack’s approach, a background sur- 
face calculation by interpolating neighboring background intensities, a thresholding by 
combining the calculated background surface with the original image and finally a 
post-processing technique in order to eliminate noise pixels, improve the quality of 
text regions and preserve stroke connectivity. Text areas are located if the distance of 
the original image with the calculated background is over a threshold. This threshold 
adapts to the gray-scale value of the background surface in order to preserve textual 
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information even in very dark background areas. The proposed methodology works 
with great success even in cases of historical manuscripts with poor quality, shadows, 
nonuniform illumination, low contrast, large signal-dependent noise, smear and strain. 
Experimental results show that our algorithm outperforms the most known threshold- 
ing approaches. 
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Fig. 10. Experimental results - binarization of a handwritten manuscript: (a) Original image; 
(b) Otsu’s method; (c) Niblack’s method; (d) Sauvola’s method; (e) The proposed method 
(including post-processing). 
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Abstract. The historical documents are valuable cultural heritages and 
sources for the study of history, social aspect and life at that time. The 
digitalization of historical documents aims to provide instant access to 
the archives for the researchers and the public, who had been endowed 
with limited chance due to maintenance reasons. However, most of these 
documents are not only written by hand in ancient Chinese characters, 
but also have complex page layouts. As a result, it is not easy to utilize 
conventional OCR(optical character recognition) system about historical 
documents even if OCR has received the most attention for several years 
as a key module in digitalization. We have been developing OCR-based 
digitalization system of historical documents for years. In this paper, 
we propose dedicated segmentation and rejection methods for OCR of 
Korean historical documents. Proposed recognition-based segmentation 
method uses geometric feature and context information with Viterbi 
algorithm. Rejection method uses Mahalanobis distance and posterior 
probability for solving out-of-class problem, especially. Some promising 
experimental results are reported. 



1 Introduction 

Korea national agencies have been preserving the archives of Lee Dynasty con- 
tains over 2 million books with more than 400 million pages in printed format. 
However, it is common sense that the best way to preserve valuable documents is 
to reduce the frequency of their being physically accessed. As a result, the main- 
tenance causes limited access to those documents. By the way, lots of people 
believe the construction of digital library for historical documents can expand 
accessibility. 

As the digitalization is expected to enlarge utilization historical documents 
efficiently, several institutions have been digitalizing historical documents from 
several years ago. For instance, we were experienced in the digitalization effort 
took 7 years by manual key-in about Korean Canon of Buddhist which is made 
up of wooden printing blocks carved by hand, consists of 160,000 pages and 
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Fig. 1. An Example of Korean Canon of Buddhist. 

56,000,000 characters. The following Fig.l shows an example of Korean Canon 
of Buddhist. However, in this case, the digital archives were constructed by 
complete manual approach. 

Digitalization of historical documents have been mostly performed by manual 
typing, in Korea. Because most of historical documents were hand-written in 
ancient Chinese characters which we call Hanja. Ancient characters which are 
not used in comtemporary texts take considerable proportion, too. 

(W (■!) (X) (JDI (W) (in) (A) (M) 

(a) variants (b) Blurred characters 

Fig. 2. Technical challenges for historical character images. 



Recently, the use of OCR technique are getting attention because the lat- 
est OCR techniques show high performances on modern printed materials. But 
digitalization of Hanja documents is not so easy with just OCR technique. Its 
main difficulty comes from shape variation due to writers’ habits, styles, and so 
on(Fig.2(a)). Also, blurred characters are often appeared in the documents be- 
cause of the complex structure of its strokes and brush ink (Fig. 2(b)). According 
to the facts deteriorating the performance of character recognition, it is impos- 
sible to expect the perfect output of OCR, and consequently, it cannot simply 
substitute for manual typing. Also the utilities of OCR to historical documents 
were very restricted[l][2]. As a result, we have developed a OCR-based system 
to enhance the overall efficiency of digitalization process[3](Fig.3). This system 
was designed for the combination of both typing and OCR to compensate one’s 
drawback by others. 

A brief description of our overall system is as follows: As the segmentation 
step is performed, scanned several hundred pages of documents are segmented 
into individual characters rapidly. Then OCR module is called up to classify 
each of the segmented characters. Namely, each segmented character image is 
fed to a recognizer to get a label. Segmentation and classification are repeatedly 
performed until all documents are processed. Also, remarkable characteristic of 
proposed system is to consider the rejection whenever character images are rec- 
ognized. Namely, the system rejects if OCR confidence is not high enough. After 



116 



Min Soo Kim et al. 




sfeiiHMitution 



Doniinrnt 

imxecM 




t'lusiniiis 



JJ 





(’liistfirilsioups 




21*1 

\>i tfirittioi) 



Fyr mN|M'<'M<ih 




Cliivlrifil eruiqis 



Fig. 3. Our overall digitalzing Scheme of Korean historical documents. 



the classification and the rejection are done, the characters with the same class 
label are collected into one of predefined character groups. And the system shows 
them to operators to verify the result. Whenever the operators find misclassified 
character, they can remove that character. It can be found a set of less similar 
characters at the back from sorted grouping result of each character(Fig.4). 




Fig. 4. User interface for eye inspection(Grouping about characters with same label). 



Therefore, operators have only to investigate misclassified characters at the 
back part of grouped characters. When the verification is over, the label of each 
character in the group is automatically assigned by the system. By virtue of 
provided user interface for eye inspection, it saves lots a laborious typing effort. 
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This paper is organized into 4 sections. Section one is this Introduction in 
which we describe overall digitalizing system with the situation of Korean his- 
torical document digitalization. Section two and three explain about proposed 
segmentation method and rejection method which play a critical role in our OCR 
process, respectively. Section four shows the experimental results and evalua- 
tions. Finally, section five is the conclusion and prospectus of future work. 



2 Proposed Recognition-Based Segmentation Method 

Approach of Hanja segmentation can be classified into recognition-based method 
or image-based methods. Because image-based methods are highly dependent on 
character shape and handwritten characters have diverse variation, such tech- 
niques have a performance limitation for segmentation of handwritten characters. 
For automation of digitalizing historical documents, high accuracy of segmen- 
tation rate is essential. Therefore, image-based segmentation methods are not 
suitable for historical documents. 

We propose to use recognition-based methods because they are effective in 
dealing with handwritten characters. But it is known that recognition-based 
Hanja segmentation methods have some problems [6] [7]: (1) Out-of-class which 
has incomplete shape is matched into a character defined by recognition model. 
(2) It is time consuming to evaluate many cnadidates for merging character 
fragments. (3) Each fragment in a character can be misclassified as individual 
character (Fig. 5). 




Fig. 5. Problems of recognition-based Hanja segmentation (a) Recognition of Out-of- 
class (b) Overhead due to recognition of many candidates (c) Misclassification of each 
fragment in a character into individual character. 

In order to solve the problems associated with recognition-based segmenta- 
tion methods, two additional criteria, character geometric feature and context 
information, are applied. Character geometric feature helps to reduce number of 
candidates for recognition and this reduction can decrease possibility of recog- 
nition for out-of-class and time needed for recognition. Context information can 
reduce the possibility of misclassifying a character fragment as individual char- 
acter. 

Because recognition-based methods employ a split-merge strategy in which 
the split segments are merged into a character, pre-segmentation stage to sepa- 
rate the text string image into segments is needed. Each segment must belong 
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Fig. 6. Pre-segmentation by nonlinear segmentation path. 



to just one character. Nonlinear segmentation path is split method that can 
separate overlapping characters and touching characters [8] . 

For generating nonlinear segmentation paths, a character string is regarded 
as a multi-stage directed graph. Observation score for each pixel and transition 
score from left pixel to right pixel are defined in advance. The less number of 
black pixels a segmentation path passes and the more straight it is, the higher 
score it gets. Some possible paths from left end to right end of character string 
are selected by using Viterbi algorithm. As shown in Fig. 6 nonlinear segmen- 
tation path can separate overlapping parts and touching parts. A segmentation 
graph is then constructed using candidate paths to represent nodes and com- 
bining scores to represent arcs. Posterior probability for geometric feature and 
character string giving character image is used for merging score. If the number 
of characters is A: in a character string, combining candidates that are made 
up of splitted components are defined by X = X\,X 2 , ■ ■ ■ ,Xk, character string 
S = si, S 2 , • ■ ■ , Sfe, and character geometric feature G = g\,g 2 , ■ ■ ■ ,gk- Posterior 
probability P(G, <S'|A) can be derived as follows: 

S = argmaxP(G, 5'|A) = arg max P(G, S', A) 

S S 

« argmaxP(G)^'=P(S, A) = argmaxP(G)^^P(S)P(A|S) 

S S 

= argmaxAc logP(G) -I- logP(S) -I- logP(A|S) 

S 

In above equation, Xq is a revision value because geometric feature G depends on 
character string and character image. This equation is derived as sum of char- 
acter geometric feature P(G), context information score P(S) and recognition 
score P(A|S). The constant A is defined by training. Geometric score P(G) is 
defined as follows. 



P(G) = P{CH)^'=P{SQU)^‘P{GAP)^^ (1) 

In (I), GH,SQU and gap denote the character height, squareness and in- 
ternal gap respectively. Also, three constants Ac,As,Ag are assigned to 1,1,3, 
respectively. Distributions of GH, SQU and gap are estimated by Parzen window 
method. If P(G) is 0, the merging candidate will be eliminated. Gontext informa- 
tion score can be computed by bi-gram language model score between previous 
and current character labels in the character string. Recognition score is the 
distance between recognition model and merging candidate image. Finally, the 
segmentation graph is generated as shown in Fig. 7. In Fig. 7, merging candidate 
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Fig. 7. Optimal segmentation path finding. 




Fig. 8. Examples of segmented images by proposed method. 

image without combining score is eliminated by character of geometric score. 
This elimination can reduce overhead of recognition. Some segmented images by 
searching optimal path are shown in Fig. 8. 

3 Recognition with Rejection-Option 
by Linear Discriminant Analysis 

LDA(linear discriminant analysis) is a classification method using discriminant 
function based on Mahalanobis distance on the assumption that features are 
from normal random variables and covariances of each class are the same. Then, 
we can calculate conditional probability of x given Wi, 

p{x\wi) = (27r)“”/^|T'|“2exp[-l/2(x - - pi)] 

We would allocate new observation to the population for which P{wi\x) is largest. 
Namely, we classify x to Wi if P{wi\x) > P{wj\x) for all i yf j. 

p(u;,|x) = = ^p{x\wk)p{wk) 
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Because Bayes’s classification rule depends on posterior probability p{wi\x), it 
is proportion to logp(x|7Ti) + logp(7Ti), 

p(7Ti|x) OC logp(x|7Ti) +logp(7Ti) 

= -ilog27T- ilog|S| - i(x-/i,)^S'i(x-^i) + logp(7T,) 

Since x^S^^x and log |S| are common factors, this results in 

p(7Ti|x) OC V* + 2 logp(7Ti) 

If we substitute parameters to MLE(maximum likelihood estimator), we obtain 
LDF (linear discriminant function) like this. 

dj(x) = S^^x - 

In historical character classification with OCR, a very important problem is 
how many character classes are chosen as a relevant set in recognition. All Hanja 
classes don’t need to be necessary in the recognition of documents, because 
many characters rarely appear in common use and this greatly increases the 
computational complexity of the recognition. We have statistically investigated 
the ancient Korean documents about 3,896 pages containing about 1.5 million 
characters over 5,599 Hanja classes of Seungjungwon Diary vol. 29. From the 
investigation, we observed that about 2,500 Hanja classes were frequently used, 
but about 3,600 Hanja classes were rarely used. Thus, we determined that 2,568 
Hanja classes, which frequently appear about 99% in the documents, should be 
considered in the recognition step(refer to Fig. 9). This is one reason we propose 
the rejection system. Another one is because the cost of detecting and correct- 
ing misclassified characters is more expensive than the cost of manual typing. 
In this paper, we propose the rejection method using maximum posterior prob- 
ability threshold which removes ambiguous characters and using Mahalanobis 
distance threshold which throws out outliers together. Fig. 11 shows results of 
rejection-used recognition, if we assume xjwi represent a random samples from 




Fig. 9. The number of characters can be considered from frequency analysis. 
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Fig. 10. Boundaries of two rejection rules. 
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Fig. 11. Rejection-used recognition results. 



multivariate normal distribution with mean vector /i^ and covariance matrix S. 
P{wi\x) can be calculated 



P{wi\x) = 



exp{—\ X 

Tfk=ie-xp{-\ X Tk^y 






The system reject if maximum posterior probability is less than threshold 9\ or 
shortest Mahalanobis distance is larger than threshold 6*2 (refer to Fig. 10). 



Reject X , if max p(wjjx) < 9\ or min Vj > 02 - 
Accept X , otherwise 
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4 Experimental Results and Evaluations 

We evaluated the effectiveness of the proposed methods using Seungjungwon Di- 
ary, an ancient Korean government document, written by many writers during 
nearly 500 years. To show the performance of proposed segmentation method, 
were used 200 historical document pages that contain 78,756 handwritten charac- 
ters of Seungjungwon Diary vol. 29. As may be shown in Fig. 12, the performance 
of proposed segmentation method is nearly similar to that of manual segmen- 
tation. The recognition rates by manual and proposed segmentation methods 
were 92.99% and 92.98%, respecively. In comparison with manual segmentation 
result, proposed method achieved performance of 99.98%. 
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Fig. 12. Comparison of recognition between manual segmentation and proposed seg- 
mentation method. 



Next, we carried out an experiment to compare precisions of the classifiers 
according rejection rates. 100 characters per class extracted from Seungjungwon 
Diary vol. 29 were used for training. Also, 200 pages extracted from Seungjung- 
won Diary vol. 29 were used for testing. E-classifier and M-classifier mean classi- 
fiers based on Euclidean distance and Mahalanobis distance, respectively. Also, 
d-thr and p-thr are Mahalanobis distance threshold and posterior probability 
threshold, respectively. As may be seen in Fig. 13, On the whole, the M-classifier 
with d-thr and p-thr rejecter is superior to others. 

As in Fig. 13, when the precision(recognition rate) of accepted characters is 
98%, we can see the percentage of rejected characters is 12.68%. From this result, 
table 1 shows the economical effectiveness of proposed system. We compared 
the input cost by manual typing with that of proposed system. Suppose that 
input cost and reform cost are 10 and 30, respectively, and we have 1,000,000 
characters. When the ratio of rejected characters is 12.68%, total input cost of 
using proposed system is 18,680,000 by contrast with total cost of manual typing 
only method is 100,000,000. 

Also, we calculated the effective of time cost using proposed system. As may 
be seen in Fig. 14, for digitalization of 10,000,000 characters, we need 1000 man- 
days using manual typing method. However, if we use proposed method, we need 
only 144 man-days. 
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Fig. 13. Comparison of precisions according to rejection rates. 

Table 1. Total cost comparison of manual typing only method and proposed method. 



Input Type 


Total Cost 


Input Cost (10 per char) 


Reform Cost (30 per char) 


Manual Typing Only 


100,000,000 


10,000,000x 10=100,000,000 


0 


Proposed System 
(Case of precision 0.98) 


18,680,000 


1,268,000x10=12,680,000 


200,000x30=6,000,000 



Man-day (time-cost ) 




Fig. 14. Time-cost comparison of manual typing method and proposed method. 



Assumption : 

— a. 1 man-day: characters can be inputted by manual typing for 1 day = 
10,000 characters 

— b. The number of historical characters must be digitalized = 10,000,000 
characters 

— c. Accepted ratio = 0.8732 

— d. precision = 0.98 

Yield: 

— Time cost of manual typing method : b/a = 1000 man-days 

— Time cost of proposed method: ((6x cx (1 — d))-|-(6x (1 — c))/a = 144 
man-days 
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5 Conclusions 

Just OCR-based Digitalizing of handwritten historical documents is a difficult 
problem because of complex layouts, blurred characters, shape variations due to 
writers’ habits, styles, and so on. We have been developing a full-fledged system 
for OCR-based digitalization using a combination scheme between a manual 
typing and OCR to digitalize historical meterials. In the system, a huge amount 
of documents were segmented at at once by proposed segmentation method 
and individual characters were identified with OCR module, and each character 
with the same class label was collected into one of predefined character groups. 
After the grouping with OCR, an operator can verify the correctness of the 
classifications and Anally input text codes for each group, instead of typing 
all the characters. Character segmentation has long been a critical area of the 
OCR process. In this paper, we proposed dedicated segmentation method uses 
geometric feature and context information with Viterbi algorithm. Posterior- 
based Rejection method designed using Mahalanobis distance, too. According 
to experimental results, we could see proposed methods helped enhancing the 
overall efficiency of the processs and reducing the costs. 
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Abstract. This paper presents how Self-Organizing Maps and especially Koho- 
nen maps can be applied to digital images of ancient collections in the perspec- 
tive of valorization and diffusion. As an illustration, a scheme of transparency 
reduction of the digitized Gutenberg Bihle is presented. In this two steps 
method, the Kohonen map is trained to generate a set of test vectors that will 
train in a supervised manner a classical feed-forward network. The testing step 
consists then in classifying each pixel into one class out of four by feeding di- 
rectly the feed forward network. The pixels belonging to the transparency class 
are then removed. 



1 Introduction 

To assure longer conservation and worldwide diffusion, libraries have dashed into vast 
programs of digitalization of their collections [1]. 

One of the most emblematic document of human history, the Gutenberg Bible that 
started the history of printing in the occidental world has been worldwide available 
through the world wide web for a few years. In the case of an ancient and rare docu- 
ment as the Gutenberg Bible, digitalization aims to provide a wider access to material 
which needs to be protected from too frequent handling in the perspective of world- 
wide diffusion but also improved conditions for scholar studies, for instance, compari- 
son of two copies. The digital images can be compared more easily than the physical 
books themselves. The possibility of comparing two copies is valuable because no two 
copies of the Gutenberg Bible are exactly identical [2]. 

After digitalization, no matter how well this stage has been done [3], it is necessary 
or desirable to apply a wide variety of image processing techniques to valorize this 
digital document. Among others, image restoration, compression, segmentation and 
then, character recognition may be envisaged for a wide variety of applications. For 
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this kind of tasks, Artificial Neural Networks (ANN) have already proven their abili- 
ties [4], 

In this paper, we propose to apply Self-Organizing Maps (SOM) to the digitized 
Gutenberg Bible. There remain four vellum full exemplars throughout the world, one 
being conserved in Gottingen. The Niedersachsische Staats und Universitatsbibliothek 
Gottingen has achieved its digitalization and the whole Bible (1282 pages) has been 
available on a web site* since June 2000. However, probably due to the lighting 
needed for digitalization, the pages present an important show-through effect. The 
reverse page information appears on the right-hand image and degrades the general 
aspect and sometimes the readability of the text. 

In the perspective of ancient collections valorization, image restoration is an impor- 
tant topic. Among others, transparency reduction may be interesting in the perspective 
of this emblematic document worldwide diffusion and valorization. After transparency 
reduction, one may see facsimiles looking much more like the original pages would 
appear with good light conditions in the Gottingen University library. 



2 Color Image Classification Using Self-organizing Maps 

Image color clustering by means of self-organizing maps has been proposed in various 
works to achieve different goals, for instance, segmentation and compression [5-10]. 
This paragraph is a very brief reminder of so-called self-organizing maps also known 
as Kohonen maps, applied on the particular case where the input vector is formed by 
the three RGB components of one image pixel. Complete developments may be found 
in [11,12]. 

The competitive layer consists in S neurons (S being chosen by the user depending 
on the problem to solve). A three components weight vector is associated to each neu- 
ron and represents the RGB components of this neuron. The competitive layer com- 
putes the distances between the current pixel and each neuron. The one which is the 
closest to the current pixel wins and is the unique whose weight vector gets modified 
according to the Kohonen learning rule. Thus, the neuron whose weight vector was 
closest to the current pixel is updated to be even closer. The result is that the winning 
neuron is more likely to win the competition if a similar pixel is presented later, and 
less likely to win when a very different pixel is presented. As more and more pixels 
are presented, each neuron in the layer adjusts its weights towards a group of pixels. 
Eventually, if there are enough neurons, every cluster of similar pixels will have a 
neuron that outputs 1 when a pixel in the cluster is presented, while outputting a 0 at 
all other times. Thus, the competitive network learns to categorize the pixels of the 
given image. 

Self-organizing feature maps (SOM or SOFM) learn to classify input vectors ac- 
cording to how they are grouped in the input space. They differ from competitive 
layers in that neighboring neurons in the self-organizing map learn to recognize 
neighboring sections of the input space. Thus, self-organizing maps learn both the 
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distribution (as do competitive layers) and topology of the input vectors they are 
trained on. The neurons in the layer of a SOFM are arranged originally in physical 
positions according to a topology function. As for the competitive layer, the SOFM 
computes the distance between the input pixel and each neuron of the map. Every 
winning neuron updates then its weights and, in addition to that, its neighbours update 
their weights too. This feature of the SOFM is responsible for the network ability to 
learn the topology of the input space. 

Whereas both the competitive layer and the SOFM use unsupervised learning, 
Learning Vector Quantization networks (LVQ) is a method that uses supervised learn- 
ing. LVQ networks classify input vectors into target classes hy using a competitive 
layer to find subclasses of input vectors, and then combining them into the target 
classes by means of a linear network trained for instance with hackpropagation. LVQ 
networks can classify any set of input vectors, not just linearly separable sets of input 
vectors. The only requirement is that the competitive layer must have enough neurons, 
and each class must be assigned enough competitive neurons. To ensure that each 
class is assigned an appropriate amount of competitive neurons, it is important that the 
target vectors used to initialize the LVQ network have the same distributions of targets 
as the training data the network is trained on. If this is done, target classes with more 
vectors will be the union of more subclasses. 



3 Application on the Digitized Gutenberg Bible 

3.1 General Principle 

The general principle consists in classifying each pixel of the page to be processed in 
one class out of four: background, right-hand text, colored letters and transparency. 
This classification then leads to practical applications as transparency reduction or 
compression which will be described in the following paragraphs. One may notice that 
we have used RGB components in our work. However, it might be interesting to apply 
the same ideas with other color frames as HSV, YIQ or CIE L*a*b* that could be 
more suited. 

The method consists first in selecting a characteristic sample of the page to be 
processed. The sample has not to be too large (in number of pixels) to avoid prohibi- 
tive computation time. It has also to include the main features of the collection, such 
as colored characters, transparency and visible writing. Figure 1 shows an extract of 
one page of the Gutenberg Bible (folio 1 Ir of the Genesis Book in volume 1). A small 
window including the reference letter in the upper left corner of the image is used to 
establish the Kohonen map. 

Figure 2 shows the pixel distribution of the preceding sample in the RGB space. 
Pixels near the origin represent the right-hand text pixels (close to black) on the right- 
hand page; the cloud in the upper right corner represents background pixels. A big 
cloud just above the one close to the origin with a strong blue component represents 
the pixels of the blue reference letter. At last, a little diffuse cloud with a strong red 
component corresponds to the few pixels of the few red-colored letters on the right 
hand text. 
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Fig. 1. Extract of the page to be processed (folio 1 Ir of the Genesis Book in Vol. 1). The upper 
left square represents the training set of the Kohonen map. 




Fig. 2. Pixel distribution of the image sample corresponding to figure 1 in the RGB space. 

Figure 3 shows the Kohonen map obtained after training on the small window. It 
consists in a retangular grid topology of 7 by 7 neurons trained using the euclidean 
distance and during 50 epochs. The principle characteristics of SOMs appear on this 
figure: the neurons density depends directly on the number of pixels of the corre- 
sponding zone in the input space and the topology is obviously conserved. Note here 
that the choice of the 7*7 dimension of the Kohonen map is just for visualization clar- 
ity and it is not really used for the final experiment. 
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Fig. 3. Kohonen map obtained after a 50 epochs training fed with the pixel distribution of fig- 
ure 2. The neuron layer has a rectangular topology and is 7 by 7 neurons wide. 



3.2 Transparency Reduction 

As a first idea to apply self-organizing maps to ancient collections, we propose to 
reduce the show-through effect that degrades the general aspect and the readibility of 
the digitized pages of the exemplary of the Gutenberg Bible held in Gottingen. One 
may object that it should be quite easy to find out a few rules on the RGB components 
to remove or at least to reduce this transparency (for instance, transparency pixels are 
not contrasted too much and the RGB components should not be very different). These 
few rules applied pixel by pixel on the whole page would probably achieve a good 
result. This is not false. In a similar way, other works have produced good algorithms 
to reduce transparency. There have been a number of methods for mitigating show- 
through effects through digital image processing techniques. Naive methods are based 
on extensions to binarization techniques for gray-scale document images [13,14,15]. 
To resolve the limitations of the naive methods, sophisticated methods have been 
developed on the assumption that images on both sides of the paper sheet are acquired 
and processed so that the front side image can be compared with the backside image 
[16]. Correspondence between the two images needs to be established with pixel pre- 
cision in some way in order to estimate the transmission coefficient of the paper sheet. 
However, it is difficult to apply this approach to ordinary input devices because solv- 
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ing the correspondence problem is intractable due to nonlinear shape distortion of page 
images. That is why more sophisticated methods have been proposed since [17,18]. 

However, one should not forget that color clustering may lead to other results be- 
yond transparency reduction (for instance, image compression). One should then con- 
sider this particular application (even if it may have some interest in itself) as an illus- 
tration of the self-organizing maps applied to color images. 

It is possible to classify input pixels into four different classes: background, right- 
hand text, colored letters (like reference letters) and reverse page pixels showing 
through. This last class is then to be removed from the original page and replaced by 
background color (the average of the eight last pixels having been detected as back- 
ground pixels). Our first approach was to use the SOM followed by Kmeans algorithm 
as done in [19]. This method gives good results with the Gutenberg bible. 

However, as the Kmeans algorithm is not reliable in the case of non separable 
classes (colored letters are not linearly separable as we can see on Fig.2), we have 
chosen to use SOM and a supervised neural network. 

To achieve this classification, we have used a feed-forward network (FFN) (using 
back-propagation of the error) with one hidden layer (because the four classes are not 
linearly separable) with 25 neurons in the hidden layer. To train this network which 
imposes supervised learning, we have used a set of 100 RGB vectors whose compo- 
nents are the RGB components of a Kohonen map of 10 by 10 neurons trained on the 
image sample corresponding to figure 1. As a supervised learning method, it is neces- 
sary to associate to each element of this set of 100 vectors, the corresponding target, 
that is, the class in which it has to be put in. 

Our method has two steps. The first one consists in training the FFN using the out- 
put of the SOM which has been trained on the small window of the chosen image. The 
testing step consists in feeding the FFN with the pixels of the image to be processed. 

Thus, the SOM is only used for the purpose of training. In fact, the SOM has the 
ability to reduce the dimension of the input data ( small window chosen for train- 
ing).This means that the weights of the neurons obtained after training the SOM rep- 
resent the most characteristic colors of the input sample. Then, the FFN is trained on 
an optimal set of colors. 

As the chosen sample is representative of the collection to be processed, we can 
now test our method on a variety of images from this collection. As shown in figure 4, 
the testing data is given directly to the FFN. Figure 4 shows the principle of our 
method: 

One has to notice that the first network, as a non supervised learning method, does 
not require to associate any target to the image sample; the dimension of this image 
sample is then limited by the computation power of the station. On the other hand, the 
second network, as a supervised learning method, requires to associate to each test 
vector its corresponding target. That is why we have chosen to use a set of 100 test 
vetors, this numerical value being reasonable for the entry by hand of the correspond- 
ing targets. 
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Fig. 4. Principle of the method of pixel classification using SOM and FFN in the training stage 
and FFN in the testing stage. The testing set can also he another image having the same charac- 
teristics as the one from which we have extracted the training data. Note that this figure is di- 
rectly related to the Gutenberg Bible. For other books or collections, the number of classes in 
the output of the FFN can be different, but the general principle will be exactly the same. 

The number of neurons is guided by two constraints. The first requirement is to 
have a sufficiently large number of neurons compared to the number of the final 
classes. For instance, a 2*2 SOM will classify the input image into 4 classes, which 
would disqualify the use of the FFN. Our experiments show that starting from a 5*5 
neurons map, we can separate the different classes. On the other hand, having a very 
large map would make the association “neurons-classes” heavy without improving the 
results in a significant way. 

Figure 5 shows the result of the transparency reduction applied on the extract of 
figure 1. Almost 15 % of the total amount of pixels has been detected as transparency 
and replaced. This rate may seem to be too large but it is due to the fact that during the 
feed-forward network training, it has been considered that it would be better to mis- 
take a background pixel for a transparency one, the pixel being replaced by a back- 
ground one anyway rather than to miss a transparency pixel that would not be re- 
moved. 

On a qualitative point of view, figure 5 shows that the transparency reduction has 
been efficient: the text is easier to read, the image much more pleasant to look at. 
Flowever, when one looks carefully at the obtained image, one can see, a) remaining 
dark pixels corresponding to transparency pixels that were too close to right-hand text 
to be correctly classified, b) a character trimming effect: some characters have lost 
some pixels because they have been wrongly removed, their RGB components being 
close to the transparency class. 

To judge the validity of our approach, we have tested in it on other images different 
from the one from which we have extracted our training data. 
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Fig. 5. Result of the transparency reduction applied on teh extract of figure 1 . 
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Fig. 6. Test figure. We can see that the transparency is higher in this example. 
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Fig. 7. Result of transparency reduction applied on the extract of Fig.6. 



The result has the same characteristics as the first one. Most of the transparency 
disappeared and we obtain the desired front writing. 

Whether simple schemes or more complicated ones like classification by ANN are 
used, as long as the classification is based on the RGB components of the current 
pixel, the effect of character trimming described above can not be avoided. A pixel of 
right-hand text may have RGB components corresponding to a light gray level and 
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then be mistaken for a transparency pixel and, similarly, a transparency pixel may be 
quite dark and be confused with right-hand text. While training, a trade-off has to be 
found between this two contradictory effects but its efficiency will always be limited. 
It is also possible to apply rules after classification and before taking decision of re- 
moval; for instance, to limit the trimming effect, even though a pixel has been detected 
as a transparency pixel, the decision not to replace it will be taken if its closest 
neighbors have been detected as right-hand text. 

However, a furher idea would consist in using once again a SOM and a feed- 
forward network whose input vectors would not only be the RGB components of the 
single current pixel but also the RGB components of its 8 closest neighbors. Thus, the 
additional rules for removal decision mentionned above would be included in the ar- 
chitecture of the neural network. 



4 Perspectives 

Our method has to be tested on a variety of Document images. There exists a number 
of images which present a higher transparency and characteristics which are harder to 
segment. The use of a FFN is then really necessary as it can separate non linearly 
separable classes of pixels. 

As another perspective of this work, we are thinking about an approach which is in- 
spired by the binarization algorithms .In fact, many thresholding techniques use the 
neighborhood of the pixel to determine its class. These methods can be extended to the 
classification of the pixels into N classes instead of 2 classes only. The locally adap- 
tive thresholding techniques seem to be quite adequate for this purpose. 



5 Conclusion 

Self-organizing maps are well suited for color clustering and classification of pixels in 
any kind of images. By using such a map as the set of test vectors to train a classical 
feed-forward network, we have described a method of transparency reduction which 
gives satisfactory results in the field of ancient collections illustrated here by the digi- 
tized Gutenberg Bible. 
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Abstract. In this paper we describe the Bovary Project, a manuscripts 
digitization project of the famous French writer Gustave FLAUBERT 
first great work, which should end in 2006 by providing an online access to 
an hypertextual edition of “Madame Bovary” drafts set. We first develop 
the global context of this project, the main objectives, and then focus 
particularly on the document analysis problem. Finally we propose a new 
approach for the segmentation of handwritten documents. 



1 Introduction 

Libraries and museums contain collections of a great interest which can not be 
shown to a large public because of their value and their state of conservation, 
therefore preventing the diffusion of knowledge. Today with the development 
of the numerical technologies, it is possible to show this cultural patrimony 
by substituting the original documents by numeric high quality reproductions 
allowing to share the access to the information while protecting the originals. 
These last years, numerous libraries have started digitization campaigns of their 
collections. Faced to the mass of the numerical data produced, the development 
of digital libraries allowing access to these data and the search for information 
becomes a major stake for the valuation of this cultural patrimony. However 
such task is difficult and expensive, so requires prior studies concerning the 
technical means to use. Although they have a great interest for the study and 
the interpretation of literary works, modern manuscripts have been adressed by 
few digitization programs because of the complexity of such documents and the 
lack of adapted tools. We present in this paper the Bovary Project, a digitization 
project of modern manuscripts concerning especially FLAUBERT’s manuscripts 
and we discuss the underlying principles, difficulties and technical aspects related 
to such a project. In section 2 we present the global context of this project 
and the aimed objectives. We discuss in section 3 the requirements of critical 
publishing and we overview in section 4 the few critical editions available! in 
electronic format and the projects dedicated to the digitization of manuscripts. 
Discussions on related problems and possible solutions are adressed in section 5, 
and we focus particularly on document analysis in section 6. Finally we propose 
a new approach for handwritten document segmentation in section 7. 
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2 General Presentation of the Project 



Recently the municipal library of ROUEN has begun a program for digitizing its 
collections. For this purpose an efficient system of digitization allowing a high 
resolution display of the digitized documents has been purchased. One of the 
first aims of this program is the digitization of a manuscript folder compound of 
almost 5 000 original manuscripts issued from “Madame Bovary” , a well known 
work of the french writer Gustave FLAUBERT. This set of manuscripts con- 
stitutes the genesis of the text, i.e. the successive drafts which highlight the 
writing and rewriting processes of the author. This digitization task is still in 
progress, and will be completely achieved in a couple of month. The final aim 
of this program is to provide an hypertextual edition^ allowing an interactive 
and free web access to this material. Such an electronic edition will be of great 
interest for researchers, students, and anyone who wants to see FLAUBERT’s 
manuscripts, especially because there is no critical edition of a full literary work 
of this author available on the web. This project called “Bovary Project” is a 
multidisciplinary project which implies people from different fields of interest: 
librarians, researchers in literary sciences and researchers in computer science. 
Flaubert’s drafts have a complex structure (fig.l left), they contain several blocks 
of text not arranged in a linear way, and numerous editorial marks (erasures, 
words insertion,...). So these manuscripts are very hard to decipher and inter- 
pret. Providing an electronic version of such a corpus is a challenging task which 
have to respect some requirements in order to meet the users needs. 
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Fig. 1. (Left) a complex Flauberts’s manuscript. (Top right) a manuscript fragment 
and associated diplomatic transcription (bottom right). 

^ We propose a prototype at www.univ-rouen.fr/psi/BOVARY 
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3 Genetic Edition Reqnirements 

In literary sciences, the study of modern manuscripts is known as genetic anal- 
ysis. This analysis concerns the graphical aspect of the manuscripts and the 
successive states of the textual content. In fact the nature of a manuscript is 
dual. A manuscript can be considered as a pure graphical representation or as 
a pure textual representation. A manuscript is a text with graphical interest 
[1]. As modern manuscripts reflect the writing process of the author, they may 
have a complicated structure and may be difficult to decipher. So in a genetical 
edition, transcriptions are generally joined to the facsimile of the manuscripts. 
A transcription allows an easier reading of the manuscript (fig.l right). One can 
distinguish two transcription types: the linear one and the diplomatic. A lin- 
ear transcription is a simple typed version of the text, which uses an adapted 
coding to transcribe, in a linear way, complex editorial operations of the author 
(deletation, insertion, substitution) sometimes located over one or several pages. 
Diplomatic transcriptions respect, as far as possible, the physical layout of the 
original manuscript, and for this reason are helpfull to the reading of complex 
drafts. 

Provide of a genetic folder consists in locating and dating, ordering, deci- 
phering, and transcribing all pre-text material. A genetic publishing presents 
the results of such a work, it means an ordered set of manuscripts constituting 
the genesis of the text and their associated transcriptions, and allows to glance 
through this set of heterogeneous data. An electronic version of such a critical 
edition must provide the same functionnalities than a traditionnal one, but fur- 
thermore have to allow to deal easily with the textual representation and the 
graphical representation of the manuscripts and create links between each other, 
using the capabilities of structured languages and hyperlinks. 

4 Related Works 

In spite of the development of multimedia technologies and the possibilities pro- 
vided by structured languages and hypertext, few electronic versions of genetic 
publishing have been released up to now. This is probably due to the difficulty of 
such a task and the lack of professionnal tools dedicated to the manipulation of 
ancient or modern manuscripts. In the following we overview the existing genetic 
editions available in electronic format (CD-ROM or website), and we discuss the 
capabilities and limitations of these editions. In a second part, we overview the 
latest projects related to the digitization of manuscripts. 

4.1 Critical Publishing 

Among the numerous digitized literary works available in text or image mode 
on Gallica, the server of the French National Library, two thematical editions 
related to the genesis of Emile Zola’s work “Le reve” and Marcel Proust’s work 
“Le temps retrouve” are proposed. These electronic publishing allow to visual- 
ize images of the author’s handwritten notes (in black and white TIFF format 
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or Adobe PDF files) and associated textual transcriptions in HTML. These re- 
leases contain a lot of explicative notes about the history of these works, but 
don’t provide any capabilities of annotation or edition. They are more pedagogic 
editions than genetic ones. The critical edition of Flaubert’s work “Education 
sentimentale” proposed by Tony Williams^ is very interesting even though only 
one chapter of the work is adressed. In fact it is the only genetic publishing of 
a work of Flaubert available in electronic format and it shows the complexity of 
Flaubert’s writing process. In spite of the wish of the designers to develop a web- 
site dedicated to a large public, its use is quite difficult for non experts of literary 
work. The interface is not ergonomic and no advanced image visualization and 
processing tools are provided. The commercial genetic edition of Andre Gide’s 
work “Les caves du Vatican” available on CD-ROM [3] is certainely the most 
achieved critical publishing available in electronic format. It allows to visualize 
Gide’s manuscripts and associated transcriptions. Tools for image manipulation 
are provided and multiple acces to the text are possible using different thematic 
tables (characters, places, keywords, ...) and a search engine, allowing different 
reading of the work, according to user’s needs . As we can notice, few critical 
editions are available in electronic format, and generally concern material with 
less genetic complexity than Flaubert’s manuscripts or don’t deal with the full 
text. 



4.2 Manuscripts Digitization Projects 

Gonsidering the lack of tools adapted to work on manuscripts and to the edi- 
tion of electronic documents from handwritten sources, some projects have tried 
these last years to define the needs of the users of digital libraries and to propose 
different environments devoted to the study of handwritten material. Most of 
them have led to the development of workstation prototypes. We briefly present 
the projects which are similar to the Bovary project and discuss the proposed 
technical solutions. The problems related in PHILEGTRE [3] are very similar to 
ours. The aim of this project was to explore the techniques needed by researchers 
in literary sciences, especially geneticians and medievists. In the context of this 
project, a workstation prototype dedicated to the edition of critical material was 
developed. This workstation integrates document analysis modules and interac- 
tive tools in order to provide an help for the transcription of manuscripts and to 
perform an automatic coupling between this structured textual representation 
and the manuscripts’ images. However the proposed algorithm for document 
analysis doesn’t work well with complex documents like Flaubert’s drafts. Fur- 
thermore this propotype has never been used for the edition of a full literary 
work. Similar aspects have been adressed in BAMBI [4] in the case of ancient 
manuscripts. The aim of this project was to provide tools facilitating the tran- 
scription of medieval texts. However the problems encountered when dealing 
with such documents are very different to the ones related to the work on com- 
plex modern manuscripts, in particular in the case of document analysis. The 
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DEBORA project [5] tried to define the needs related to the use of digital li- 
braries and electronic books. The aim of this project was to develop tools for 
the digitization and the access to a selection of book of the 16th century. The 
results of this work have allowed to provide an environment dedicated to collab- 
orative studies on such digitized collections. This workstation is not designed for 
authorial manuscripts but some important technical aspects have been adressed 
during this project, and among them the problem of images format and compres- 
sion. Once again we can notice that few projects have adressed the problem of 
electronic manipulation of manuscripts and they have proposed specific solutions 
adapted only to limited categories of manuscripts (generally medieval text). 

5 Related Technical Aspects 

5.1 Image Compression 

One of the first problem we have to adress concerns the storage and the format 
of manuscripts images. In fact as manuscripts are studied for their graphical 
aspect, the facsimile has to be as conform as possible to the original, it means 
that the images must have a high quality. However a high resolution leads to 
files of great size (several mega- bytes), and so implies long time of loading in the 
case of an online edition. So it is necessary to compress the images in order to 
reduce the files to a size less than 1 Mo. The choice of an adapted compression 
algorithm is a real problem. With the GIF and JPEG formats commonly used on 
the Internet, the degradation on the images is too important. New algorithms like 
DjVu seem to be more adapted to the compression of document images, and are 
more and more used in digital libraries. They provide good compression ratio 
while preserving good image resolution, even on handwritten documents. An 
improved algorithm using the same principle has been developed in DEBORA 
for printed documents. 



5.2 Indexation and Information 

Retrieval The advantage of an electronic edition is to provide multiple access 
to the text and to allow the user to find specific information. To provide such 
capabilities it is necessary to proceed to an indexation of the manuscripts im- 
ages. But image indexation is a difficult task. A possible solution is to provide 
structured transcriptions of the textual content of manuscripts. So the choice of 
a structured language adapted to the encoding of critical edition is necessary. 



5.3 Transcription Production 

The transcription of the manuscripts requires a collaborative work with the re- 
searchers and the specialists of Flaubert. At the begining of this project we 
don’t have any transcriptions of the manuscripts. Generally this task is per- 
formed manually using simple text editors. In order to facilitate this task and 
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to produce a structured textual representation adapted to users’ requirements, 
it is necessary to provide an editing environment integrating document analysis 
methods and interactive tools. As projects like PHILECTRE and BAMBI have 
shown, document analysis can bring an help for the transcription and indexa- 
tion of manuscripts. Document analysis consists in extracting the structure of 
a document. In the case of manuscripts, it consists in extracting text lines and 
graphical elements like erasures. In spite of the progress made in the analysis 
of machine printed documents, the analysis of handwritten documents is still 
a challenging task. In the following section we overview the existing methods 
for the analysis of handwritten sources, then we discuss the limitations of these 
methods and propose a new approach to deal with complex manuscripts. 

6 Handwritten Documents Analysis 

6.1 General Problem of Document Analysis 

Document analysis is a crucial step in document processing, and consists in 
extracting the geometrical layout of a given document from its low level rep- 
resentation (image). The aim is to construct a higher level representation: the 
structure of the document. This task involves to locate and separate the differ- 
ent homogeneous regions or objects of the document, and determine the spatial 
relations between these objects. In the case of textual documents it implies to 
extract the textual structure of the documents in term of sections, paragraphs, 
text lines, words. The structure we aim to extract is hierarchical. Such a struc- 
ture can be modelized by a tree. The elements of a document constitute an 
object hierarchy. In fact a page of handwritten text is composed of paragraphs, 
which are composed of text lines, and text lines are formed by words,... 

Two strategies are possible to extract the geometrical structure. The “bottom- 
up” strategy iteratively groups objects using local informations, starting from 
primitives of the document well segmented such as connected components, in 
order to reconstruct objects of higher level. A “bottom-up” strategy tries to re- 
construct the tree representing the structure of the document, starting from the 
leaf of the tree (bottom) to the root (up) . The “top-down” strategy on the con- 
trary starts from the the root representing the entire document, and recursively 
tries to developp each branch of the tree. A third possible strategy consists in 
combining a bottom-up and a top-down analysis. Numerous methods using one 
of these strategies have been proposed for the analysis of machine printed doc- 
uments. Among the most popular we can cite Rise’s method [13] based on area 
Voronoi diagram, O’Gorman’s Docstrum method[14] based on neighbor cluster- 
ing and Nagy’s X-Y cut [15] based on the analysis of projection profiles. These 
methods provide good results on printed documents, but are not directly adapted 
to handwritten documents, because they generally take only into account global 
features of the page, and are thus dedicated to well structured documents. Unlike 
printed documents, handwritten documents have a local struture which presents 
an important variability: fluctuating or skewed text lines, overlapping words, un- 
aligned paragraphs,... To cope with this local variability, the methods proposed 
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in the litterature for the segmentation of handwritten documents are generally 
“bottom-up” and are based on local analysis. 



6.2 State of the Art 

The few methods proposed in the literature focus in the particular case of text 
line extraction and are all based on a bottom-up strategy. [9] proposes an ap- 
proach based on perceptual grouping of connected components of black pixels. 
First, the connected components of the document image are extracted. The elon- 
gated components which have a reliable direction are choosen as starting point 
candidats of possible alignments. Then text lines are iteratively constructed by 
grouping neighboring connected components according to some perceptual cri- 
teria inspired by the Gestalt theory such as similarity, continuity and proximity. 
Hence local neighborhood constraints are combined with global quality measures. 
As conflicts can appear during the grouping process, the method integrates a re- 
finement procedure combining a global and a local analysis of the conflicts. This 
procedure consists in applying a set of rules which consider the configuration 
of the alignments and their quality. A text line extraction module using this 
method has been integrated in the reading and editing environment for schol- 
arly research on literary works developped during the PHILECTRE project [10]. 
According to the author the proposed method cannot be applied on degrated or 
poorly structured documents, such as modern authorial manuscripts. 

A method based on a shortest spanning tree search is presented in [7]. The 
principle of the method consists in building a graph of main strokes of the docu- 
ment image and to search for the shortest spanning tree of this graph. The main 
idea of this method is to assume that for each text line it is possible to find a 
minimum distance path that extends from one end to the other end of the line, 
that is to say a shortest spanning tree that spans the main stroke of the consid- 
ered text line. This method assumes that the distance between the words in a 
text line, is less than the distance between to adjacent text lines. This method 
has been applied to Arabic text. 

In [8] an iterative hypothesis- validation strategy based on Hough transform is 
proposed. The skew orientation of handwritten text lines is obtained by applying 
the Hough transform on the gravity center of each connected components of the 
document image. This allows to generate several text line hypothesis. Then a 
validation procedure in the image domain is performed to eliminate erroneous 
alignments among connected components using contextual information such as 
proximity and direction continuity criteria. According to the authors this method 
is able to detect text line in handwritten documents which may contain lines 
oriented in sevaral directions, erasures and annotations between main lines. 

An algorithm based on the analysis of horizontal run projections and con- 
nected components grouping and splitting procedures is presented in [6]. First 
the image is partionned into vertical strips and then an analysis of the run pro- 
jections on each strip is applied. This method allows to deal with fluctuating or 
skewed text lines and to preserve the punctuation. 
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[12] proposes a method for line detection and segmentation in historical 
church registers. This method is based on local minima detection of connected 
components. It is applied on a chain code representation of the connected com- 
ponents. The idea is to gradually construct line segments until an unique text 
line is formed. This algorithm is able to segment text lines closed to each other, 
touching text lines and fluctuating text lines. 

The main problem of these methods is that they generally take local deci- 
sions during the grouping process, and they sometimes fail to find the “best” 
segmentation when dealing with complex documents like modern manuscripts. 
Furthermore these methods don’t use prior knowledge or don’t express it explic- 
itly, making an adaptation to different classes of documents difficult. To avoid 
these drawbacks, we propose to adapt the techniques used in the domain of 
structured document recognition, to formalize prior syntactical knowledge [16], 
and we propose a segmentation strategy based on the principles of Dynamic 
Programming. 

7 A Dynamic Programming Approach 

for the Segmentation of Handwritten Documents 

We proposed a new approach based on dynamic programming principles for the 
segmentation of complex or poorly structured offline handwritten documents, 
such as cultural heritage manuscripts. The main idea of our method is to use 
prior knowledge formalized using layout rules and an adapted search strategy 
which take contextual information into account. The segmentation strategy is 
viewed as a bottom-up grouping process directed by the search strategy. We use 
a state tree based formalism to modelize the bottom-up grouping process. Each 
state of the tree represents a partial segmentation of the document. The aim 
of the search strategy is to lead the grouping process in order to find the best 
partition of the document into physical objects. 

7.1 Modelization of Prior Knowledge 

The prior knowledge concerning the structure of the document is modelized using 
a grammar, that is a set of layout rules. The terms we consider in this grammar 
are hierarchical. They corresponds to the physical objects of the document. A 
paragraph is composed of text lines, text lines are composed of connected compo- 
nents,... The rule of the grammar highlight the grouping of objects into an object 
hierarchically higher. For example, text lines are produced and developped by 
grouping connected components. The forms of the rules are as following: 

- A ^ B 

-C^DEorC^ED 

Where B and E are components respectively of A and C, and D is a terms of 
the same hierarchical level than C. 
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For example, the following rule corresponds to adding a connected compo- 
nents to an existing text line: 

— textLine textLine -I- connectedComponent (1) 

When applied this rule provides a new instance of the considered text line. The 
symbol “-I-” is an adjacency spatial operator. 

A cost is associated to each rule of the grammar. This cost corresponds to 
the probability of the rule to be applied. It order to take into account some 
contextual information during the segmentation, the terms of the grammar are 
defined by a feature vector. The probability of a rule depends on the features 
of the terms involved by the rule. For each rule we assume that the density 
probability function can be estimated by a normal distribution. Given a rule 
and a feature vector X describing the terms involved, the probability of the rule 
is given by: 



Pr {X) = ^ 

(27r)#|i:|2 

Where rrii is the mean feature vector and Si, the covariance matrix for the 
rule ri- These parameters can be estimated on a training set of documents, d is 
the feature vector dimension. 

For example, if we consider rule (1), the cost depends on the features of the 
considering text line and the features of the connected components to be added 
to. The distance between the connected component and the text line, according 
to the average distance between components in the text line, or the skew of the 
connected component according to the average skew of the components in the 
text line, are possible features to be taken into account. 

7.2 Search Strategy 

We have explain how to modelize the prior knowledge about the structure of the 
document, we have now to determine how to parse the document. The question 
is how to parse the data (document objects) and which rule to apply? The aim 
is to find the best segmentation, that is the best parse tree of the document con- 
sidering the grammar, and the data. This problem can be resolved using path 
finding techniques. This formalism is equivalent to search for the lowest cost 
path from root to a goal node in a state tree. That necessits to define what is 
the initial state of the problem to be resolved and the final state. We assume 
that the root of the state tree corresponds to the initial data of the problem, that 
is the set of data to parse and a set of rules. A goal node represents a possible 
solution of the problem, that is a possible segmentation of the document and in- 
termediate nodes of the state tree, a partial segmentation. A transition between 
two states corresponds to the application of a rewriting rule. The transition cost 
is the probability of the applied rule. The probability of a path is obtained by 
multiplying the probabilities of the rules applied along this path. What we want 
to find is the most likely path leading to a goal. This is an optimal pathfinding 
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problem. To solve it, the search strategy we use, is a Branch-and-Bound proce- 
dure. This method assumes to find the optimal path. At each step of the search, 
the best path developped so far is extended first. All the valid extensions (rewrit- 
ing rule application) of the path are generated and memorized. This necessits 
to know exactly which rule can be applied and on which data this rule can be 
applied. However when parsing bidimensionnal structure, the next data to be 
analyzed isn’t know. Theorically we must explore all the solutions by applying 
the rules on all the data not yet analyzed. To avoid the combinatorial explosion, 
it is necessary to prune some solutions. For this purpose we use a neighboring 
graph which allows at each step to select the data to be considered. This neigh- 
boring graph represents the neighbourhood relations between the objects of the 
document. 

7.3 Segmentation Example 

To illustrate the application of the approach we consider the following example. 
Given a manuscript image fragment containing two skewed text lines (figure 
1), we want to isolate one. For this problem, the grammar we consider is quite 
simple, and consists of one rule and its contrary: 

— R\'. textLine textLine connectedComponent 

— R\ '. textLine ^ textLine connectedComponent 

The rule R\ represents the adding of the current connected component to 
the text line, and i?i is the negation of this rule. 

The figure 2 represents the neighbor graph which is used to select the data 
during the parsing. This graph is updated each time a rule is applied. A goal is 
found when there is no more data to be analyzed. 

Finally the figure 3 highlights the succession instanciation of the rule R\ and 
the resulting extracting text line. 







Fig. 2. Two text lines of Flaubert’s manuscript. 



8 Conclusion and Future Works 

We have presented the Bovary Project, a cultural heritage manuscript digitiza- 
tion project, and discussed the related requirements and technical aspects. As we 




Enriching Historical Manuscripts: The Bovary Project 



145 




Fig. 3. Neighbor graph. 




Fig. 4. Successive instanciation of rule Ri . 



have seen, document analysis can provide help for the transcription of modern 
manuscripts, and allows a text-image coupling between these structured textual 
representation, which is required for computerized processing, and the image 
representation which gives the graphical aspect of the manuscript. However no 
document analysis method is now able to deal with such complex and weakly 
structured documents. We have proposed a new approach for the segmentation 
of handwritten documents. This approach combines local and global analysis, 
by integrating prior knowledge about the structure to be segmented and local 
contextual features. In order to cope with ambiguities, the proposed formalism is 
probabilistic. The parsing algorithm makes use of the principles of path finding 
of the graph theory. First results on simple test problems show the feasability of 
the method. Future works will consist in improving the method by considering 
much more contextual features and using learning techniques to determine the 
probabilities of the rewriting rules. 
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Abstract. Voronoi tessellation of image elements provides an intuitive 
and appealing definition of proximity, which has been suggested as an 
effective tool for the description of relations among the neighboring 
objects in a digital image. In this paper, a Voronoi tessellation based 
method is presented for word grouping in document images. The Voronoi 
neighborhoods are generated from the Voronoi tessellation, with the 
information about the relations and distances of neighboring connected 
components, based on which word grouping is carried out. The proposed 
method has been evaluated on a variety of document images. The 
experimental results show that it has achieved promising results with 
a high accuracy, and is robust to various font types, styles, sizes, skew 
angles, as well as different text orientations. 



1 Introduction 

Segmenting word objects from document images is an essential component for 
most document analysis and recognition systems. Its goal is to group a set of 
image elements of characters, touched characters, or character portions into a 
word. The accuracy of the word grouping greatly influences the performance of 
the systems. If word grouping is incorrectly done, serious irreparable errors could 
occur in the subsequent processing. This is especially true for most word-based 
recognition systems and word spotting systems. However, it is not trivial to 
develop a word segmentation method with not only high accuracy but also 
robustness to various documents. 

Many methods for word grouping in document images have been suggested 
by researchers. Flethchar and Kasturi[l] described a method in which Hough 
transform was used to group characters into words. Hough transform is applied to 
the centroids of the rectangles enclosing each connected component, the collinear 
connected components are thus located. The positional relationships between 
the collinear connected components are then examined to locate the words. This 
method, however, is not capable of dealing with the documents with various sizes 
of characters, because the process depends on the average height of all connected 
components. Jain and Bhattacharjee[2] considered text image as textured objects 
and used Gabor filtering for text analysis. Obviously, this method is sensitive 
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to font sizes and styles, and it is generally time-consuming. Ittner and Baird[3] 
assumed that the distribution of scaled inter-symbol distance parallel to text-line 
orientation is bimodal with one model for inter-symbol spaces and the other 
mode for inter- word spaces. Wang et al.[4] presented a statistical-based approach 
to text word extraction that takes a set of bounding boxes of glyphs and their 
associated text lines of a given document and partitions the glyphs into a set 
of text words. The probabilities, estimated off-line from a document image 
database, drived all decisions in the on-line text word extraction. An accuracy 
of about 97% was reported. In [5], Park introduced a 3D neighborhood graph 
model which can group words in inclined lines, intersecting lines, and even 
curved lines. Sobottka[6] proposed an approach to automatically extract text 
from colored books and journal covers. Tan and Ng[7] gave a method using 
irregular pyramid structure. The uniqueness of this algorithm is its inclusion of 
strategic background information in the analysis. 

A crucial step for word grouping is to find the neighbors in proximity of 
a particular image element. A naive approach is to compare the distances of 
the element to all the others in the image, and the neighbors are then defined 
by those with shorter distances. Such definition is not always accurate, because 
an element with a short distance is not a real neighbor sometimes. Voronoi 
tessellation (also named as Voronoi diagram) provides a useful tool which is 
capable of generating minimal in the number but complete neighbors of an 
element, i.e. only those elements that are closest are obtained, but all are 
included. The Voronoi tessellation of a collection of geometric objects is a 
partition of space into cells, each of which consists of all the points closer to 
one particular object than to any others. It divides the continuous space into 
mutually disjoint subspace according to the nearest neighbor rule. In the past 
decades, increasing attentions have been paid to the use of Voronoi tessellation 
for various applications. 

The most important and significant contribution of the Voronoi tessellation 
to image analysis is that it introduces neighboring relations into a set of 
elements(e.g. connected components) on a digital image. In particular, it enables 
us to obtain neighbors without recourse to predetermined parameters. In recent 
years, there are some reports in the literature about applying Voronoi diagram 
to document image analysis. For instance, Ittner and Baird[3] applied the 
Delaunary triangulation, dual of the Voronoi diagram, to detect the orientation 
of text lines in a block, based on the assumption that most Delaunary edges 
lie within rather than between text lines. Xiao and Yan[8] described a method 
of text region extraction using the Delaunay tessellation. In both of the above 
two methods, the connected components in a document image are represented 
by their centroids. Such simplification is inappropriate in some cases, because 
the centroid is a poor representation of shapes for non-round elements. Since 
a document image generally contains various characters of different sizes and 
different intercharacter gap, the approximation of each element as a single point 
is too imprecise, and it does not adequately represent the spatial structure of 




Word Grouping in Document Images Based on Voronoi Tessellation 



149 



the page image. Therefore, the point Voronoi tessellation is unsuitable for some 
applications. 

Considering the complex shapes of image elements, the use of area Voronoi 
diagram has been investigated for document image analysis. For example, Wang 
et al. [9] applied area Voronoi tessellation for segmenting characters connected to 
graphics, based on the observation that area Voronoi tessellation represents the 
shape of connected components better than the bounding box does. Kise et al. 
employed area Voronoi diagram to perform page segmentation[10] and text-line 
extraction[ll]. 

Word grouping would evidently benefit from the information provided by the 
Voronoi tessellation. However, the research on this topic has not been extensively 
studied so far, except Burge and Monagan’s work[12] which made an attempt 
using the Voronoi tessellation for grouping words and multi-part symbols in a 
map understanding system. An obvious shortcoming of their method is that 
it requires the necessary information of the resolution at which the processed 
image was scanned. In most general sense, the image resolution is unknown for 
a document analysis system in most cases. 

Based on the area Voronoi tessellation, a method for grouping the image 
elements to word objects is proposed is this paper. No priori knowledge such 
as character font, character size or intercharacter spacing is required for the 
proposed method, and no special word orientations are assumed. Experimental 
result on real document images shows that more than 99% of words are 
successfully extracted. 

2 Word Grouping 

Based on Voronoi Neighborhoods Analysis 

We suppose the area Voronoi tessellation of a processed document image has 
been obtained. For the details about how to constructing the Voronoi tessellation 
in a digital image, readers can refer to [13]. A test image, as in Fig. 1, with 
skewed text of different font styles and sizes, is used to demonstrate the 
performance of our proposed word grouping method based on the analysis of 
Voronoi neighborhoods. As shown in Fig. 1(a), Voronoi edges lie between any 
two adjacent connected components (elements). In other words, every word 
component is represented as a set of Voronoi tessellations which are adjacent 
with one another. If two elements Cj and Cj share parts of their Voronoi edges, 
they are said to be Voronoi neighbors each other. From Fig. 1(b), we can find 
that the Delaunay triangulation, the dual of the Voronoi tessellation, shows us 
the relations among Voronoi neighbors. Each edge of the triangles connects two 
Voronoi neighbors. As a result, a neighborhood graph can be constructed from 
the area Voronoi tessellation. In the neighborhood graph, each node represents 
an image element, and each edge is a connection to its neighboring element. 
The distance between two Voronoi neighbors is treated as the weight of the edge 
connecting them. Such an edge is represented using a 3-tuple: 
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Fig. 1. Voronoi tessellation of a document image. 



where Ci and Cj are two Voronoi neighbors, and the distance dij is defined 
as follows. Suppose Lij = be the Voronoi edge between the two 

elements and ej, where Ik is a point on Ly. Note that d{lk,€i) = d{lk,ej) 
according the definition of Voronoi edge, but in digitized space d{lk,ei) = 
d{lk,ej) ± 1 is also a possible case. We define the dy as the shortest distance 
summation of the distance from the Voronoi edge Ly to the elements Ci and ej, 
viz. 

dy = min (d(/fe, e*) + d(Zfc, e^)) (1) 

l<fc<m 

Table 1 lists the distances of some neighbor pairs generated from Fig. 1. 
Then the goal of grouping elements to words becomes a goal to search the 
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Table 1. Neighbor pairs and their distances. 
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59 


(8,1) 


58 






(7,30) 


71 











neighborhood graph to generate subgraphs, so that connections among elements 
from different words are deleted, but connections among elements belonging to 
same word objects remain. The process of word grouping is, therefore, considered 
to be the selection of the edges in the neighborhood graph, which connects two 
elements potentially in the same word. To this end, we need criteria for deciding 
which connecting lines should be deleted, and which should remain. 

We can see from Fig. 1 and Table 1, that an element generally has more than 
three neighbors. However, in most cases, only one or two of the neighbors are 
within the same word that the element belongs to. For example, in the case that 
two neighbors belong to the same word, one is its preceding character, the other 
is its succeeding character. Therefore, we take into account only the two nearest 
neighbors. One is of the most shortest distance and the other is of the second 
shortest distance from the element. An example is shown in Fig. 2, where only the 
two nearest neighbors with the most shortest distance are selected, whereas the 
others are excluded from further processing, i.e. we need only consider whether 
the two nearest neighbors should be grouped with the element in the subsequent 
process. With this, most of the connections between text lines are effectively 
eliminated. 

Now the task we are facing is how to know a remainder connection between 
the elements is within a word or between words. To solve this problem, we employ 
four features defined as follows. Suppose the element Cfc has two most nearest 
neighbors. They are ey with the most shortest distance dkf, and with the 
second most shortest distance dks- The characteristics of them are as follows: 
the heights, widths and areas of Cfc, e/, are {hk, Wk, Ufc), {hf, Wf, a/) and 
{hs, Ws, as), respectively. The four features are defined as: 






^fc/ 

min{{hk + Wk)/2, {hf + Wf)/2) 



(2) 
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Fig. 2. Two selected nearest neighbors with most shortest distances. 




Fig. 3. Occurrence frequency vs. fi . 



/2 = 



dks 

min{{hk + Wk)/2, {hg + Ws)/2) 



h = 



dks dkf 
dks 



r ak 

U = — 
a/ 



( 3 ) 



( 4 ) 



( 5 ) 



To investigate the features /i, /2 and /s, 100 real document images 
with varying character fonts, sizes and different intercharacter, inter-word and 
inter-textline gaps, are utilized to obtain the statistical characteristics of them. 
The statistical results of occurrence frequencies are demonstrated in Fig. 3-5, 
respectively. Feature /i is the normalized distance from Ck to its most nearest 
neighbor e/. We can see the fact, from Fig. 3, that the majority of the pairs 
(efc, 6f) are within same words, but exceptions do exist. One exception is found 
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Fig. 4. Occurrence frequency vs. / 2 . 




in some single-character words like the indefinite article “a” . Another exception 
is found in some elements with small areas like the dots on ‘i’, ‘j’, and the 
punctuation. For the former case, the “a” need not group with others. As for 
the latter case, we will further discuss it later. Therefore, we can select Tfi as 
shown in Fig. 3 as a threshold. If /i < T/i, then Ck and e/ should be grouped. 

Feature /2 is the normalized distance from Ck to its second most nearest 
neighbor e^. Figure 3 shows us that, most of the pairs (e^, Cg) are within same 
words. But there are some pairs from different words. For example, the last 
character of a word, and the first character of its succeeding word generates 
such a pair. So, the decision cannot be made only using Feature /2. 

Feature /a is the normalized difference between the distances dks and dkf, 
as given in Fig. 5 . If an element is located in the middle of a word(i.e. neither 
the first character nor the last character), the value of its /a feature is small. 
Otherwise, the value is large. We combine the features /2 and /a to produce 
another criterion. If /2 is less than T/2 and /a is less than T/a, then Cfc and eg 
are grouped, and of course Ck and e/ are grouped as well, where T/2 and T/a are 
two thresholds as shown in Fig. 4 and 5 respectively. 

Then, two criteria can be summarized as follows: 

Rule 1 : if fi < T/i, then Cfc and e/ are grouped. 

Rule 2 : if /2 < T/2 and /a < T/a, then Cfc, e/ and e* are grouped. 
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Based on the above process, there are two problems left. One is that many 
dots of characters ‘i’ and ‘j’ are not grouped to the corresponding words. The 
other is that some punctuation marks are erroneously grouped with words. From 
the properties of the elements, it is difficult to distinguish the dots on the 
characters ‘i’ and ‘j’ from the punctuation like commas and full stops. Anyway, 
they have the same characteristic of relative small areas. We therefore employ 
Feature /j, the area ratio, to identify them from others. If /4 of an element is less 
than a predefined threshold T/4 (say 0.25 empirically), it undergoes a further 
process then. 

Our observations find that the two nearest neighbors of the dots on ‘i’ and 
‘j’ come from the same word in general. On the other hand, for an element 
of punctuation, its most nearest neighbor is generally the last character of its 
preceding word, whereas its second nearest neighbor is the first character of its 
succeeding word. As a result, its /s normally has a larger value. We then utilize 
the following criteria to estimate them: 

Rule 3: if /a < T/a and < Tf 4 , then Ck and e/ are grouped. 

Rule 4: if /2 > T/2, /a > I/a and /a < Tf^, then cannot be grouped 

with e/. 

Generally speaking, commas and full stops are commonly used punctuation 
marks in documents. Rule 4 can effectively detect them. It is also worth noting 
that, some special symbols such as dash(‘-’), tilde(‘~’), and various kinds 
of parentheses should be detected and excluded from word 

grouping. For this purpose, Kim’s method[14] can be applied as post-processing. 

Figure 6(a) shows the processing results of Fig. 2, and Fig. 6(b) gives the 
corresponding Voronoi edges separating the grouped words. 

3 Experiment Results 

To evaluate the performance of the proposed method, the document image, from 
different sources, with various characters of sizes and fonts, are used for the test. 
They are 100 page document images selected from the UW document image 
database, and the images provided by the Digital Library of our university, 
including 328 document images of scanned books and 476 document images of 
scanned outdated student theses. 

Figure 7 shows an example of word grouping results carried on an image of 
scanned books, in which the segmented word entities are bounded using rectangle 
boxes. It has be showed that our proposed method can deal with the words of 
different text orientation. The method is also insensitive to the character sizes 
and fonts. 

The word grouping performance on the test documents is tabulated in Table 
2, including the rates of correction, splitting(fragmentation) and over-merging. 
An encouraging accuracy of over 99% has been achieved. The errors of word 
extraction can be divided into two categories. One is fragmentation or splitting, 
in which one word is erroneously divided into two or more words. The other one 
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Fig. 6. Word grouping results of Fig. 1. 



is over-merging, in which two or more words are merged into one word. Some 
examples of failed word bounding by the present algorithm are illustrated in 
Fig. 8. 

4 Conclusions 

Voronoi tessellation is an effective tool for representing the neighboring relations 
among elements in a digital image. A Voronoi neighborhood based algorithm 
for grouping word objects in document images is presented in this paper. The 
Voronoi neighborhoods are generated from the Voronoi tessellation for word 
grouping, with the information about the relations and distances of neighboring 
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Fig. 7. An example of word grouping. 
Table 2. Performance of word grouping. 
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Accuracy (%) 


98.83 


99.06 


99.26 


99.05 


Splitting(%) 


0.56 


0.28 


0.43 


0.42 


Over-merging(%) 


0.61 


0.66 


0.31 


0.53 
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™ COMPUTER/LAW JOURNAL 

*Translated 

«> Database Management 

Fig. 8. Examples of failed word bounding. 

connected components. The experimental results on various document images 
have shown that the proposed approach has achieved promising results with a 
high accuracy, and is robust to various font types, styles, sizes, skew angles, as 
well as text orientations. 
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Abstract. Region-of-Interest (ROI) techniques are often utilized in natural still- 
image coding standards such as JPEG2000 [1]. In contrast, document image 
coding typically adopts multi-layer methods [2], using a carefully selected algo- 
rithm for each layer to optimize overall performance. In this paper, an ROI-based 
method is proposed for multi-component document image coding, where rectan- 
gular textual ROFs are easily extracted using standard document image analysis 
techniques. Compared to multi-layer methods, the method is simpler and scal- 
able, while preserving comparable visual quality at equivalent PSNR. 



1 Introduction 

Historically, the most common format for document images has been binary for reasons 
of efficient storage, leading to the development of binary document image coding stan- 
dards such as JBIGl and JB1G2 [3], a token-based compressor. However, as demand 
for higher image quality has grown and the range of digitized documents increased, 
gray-scale and color document image representations have become common, although 
these increase storage space and/or transmission time. Hence, it is now essential to de- 
sign document image coding algorithms that can compactly represent multi-component 
document images. 

The primary content of multi-component document images typically consists of 
text, lines and drawings, and high resolution is always required to display this fore- 
ground information. The residual can be regarded as background, and for transmission 
purposes is a less important region, which can be displayed later and at lower resolu- 
tion than the foreground, as is also considered acceptable in [4]. The majority of color 
document image compression algorithms use segmentation-based multi-layer methods 
to meet these requirements. One example, DjVu [2], defines three layers: mask layer, 
foreground layer, and background layer. The mask layer specifies the shape of text and 
high-contrast lines, distinguishing which pixels should be coded using the foreground 
or background coding algorithms. The DjVu foreground layer defines the color de- 
tail within the mask layer, while background texture outside the mask is coded using 
wavelet-coding techniques. In [2] Figure 3, examples of text compressed by the DjVu 
technique are shown for an ‘XVIIIth Century book’ (historical font) and the ‘US First 
Amendment’ (handwritten), along with contemporary newspaper, magazine and scien- 
tific articles. 

S. Marina! and A. Dengel (Eds.): DAS 2004, LNCS 3163, pp. 158-169, 2004. 

(gl Springer- Verlag Berlin Heidelberg 2004 
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The independent coding of foreground mask and background layers allows high 
subjective document image quality to be maintained by prioritizing the bit-budget to- 
wards the foreground layer. However, different coding methods are applied to each of 
the three layers, effectively coding each image three times. For example, DjVu [2] uses 
the JB2 algorithm, a variant of JBIG2 and hence a token-based compressor, for the mask 
layer at 300 dpi, and the IW44 wavelet encoder for the background layer at 100 dpi. This 
increases the total complexity and coding delay, and constrains potential application ar- 
eas. It also introduces redundant data, leading to sub-optimal objective performance in 
both lossless and Tossy’ coding modes. Our experimental work [5] also shows that the 
mask layer typically occupies at least half of the total file size, limiting scalability. 

In this paper, an alternative approach is proposed that uses wavelet coding for both 
foreground and background regions, but which bit-plane shifts to produce a high-quality 
foreground image. The approach has been tested on archive and historical documents. 
The method uses a simple rectangular text block detection process to segment the ROI, 
which is then coded with a very small bit overhead, compared to the DjVu mask layer. 
This approach is similar to the JPEG2000 ROI method, except that the concept of the 
system automatically determining the ROI map is lacking in JPEG2000, since in natural 
images there is no general criterion that can be specihed to identify ROI’s. In contrast, 
rectangular regions predominate in ’city block’ document formats, although, in gen- 
eral, the ROI method is not restricted to rectangular regions as, just as in JPEG2000, 
an arbitrary region can be dehned by its coordinates. Compared to multi-layer meth- 
ods, the proposed coding method is simple, can include both Tossy’ and near lossless 
representations in a single stream, and supports both Signal-to-Noise (SNR) and reso- 
lution scalability with comparable visual quality at equivalent scale. Where text regions 
are small and irregular in some way, then a token-based compressor is limited by the 
reduction in repeated patterns. However, an example document which is all text is in- 
cluded in the tests, compensating for this weakness. The major advantage of the ROI 
approach is in encoding documents with regions of differing contrast and varying text 
types, whereas multi-layer methods are more suited to documents with high-quality text 
and embedded high-quality continuous tone illustrations, as occur in some magazines. 



2 Text Region Detection 



Archive documents are a typical example where a richer representation of a color doc- 
ument image is required to convey not only the textual content of the document (which 
could be satisfactorily represented in binary or even as encoded characters) but also its 
feel and context. Fig. 1 shows an example document image from an index card archive 
of Lepidoptera (butterflies and moths) at the UK Natural History Museum, with su- 
perimposed rectangular textual ROI’s. The full Lepidoptera archive consists of several 
hundred thousand cards, now searchable over the Internet [6] . 

The format of archive index cards consists of several independent blocks of text, 
and each block consists of one or more logically related text helds, as shown in Eig. 1. 
Blocks retain a fairly consistent mutual layout over a complete archive, but the layout 
of text helds within each block is not strictly hxed. Nor are there any tabular guidelines 
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Fig. 1. Example Lepidotera archive index card, showing foreground (detail) and background re- 
gions. 



defining fixed block boundaries. The X-Y cuts algorithm is, therefore, an appropriate 
segmentation algorithm for this class of document image structure. 

Pixel smearing [7], with a threshold sufficient to join adjacent text characters but 
not adjacent horizontal words or vertical lines, is applied as a pre-processing non-linear 
low-pass filter to each archive card image. The X-Y cuts algorithm [8] is then applied 
to the set of archive cards. The algorithm extracts and stores the contents of each index 
card into a hierarchical tree structure (the so-called X-Y tree), consisting of text blocks, 
lines and words. The level of the cut is dependent on the white space: conventionally, the 
space between blocks is greater than between lines and words. The first level cut thus 
separates horizontal blocks (which may contain several lines of text) based upon their 
vertical spacing, the second level segments lines vertically, and the last level cut sepa- 
rates words horizontally within each line, using binary connected component bounding 
boxes. The extracted contents stored in the X-Y tree thus follow a sequence where 
the top level of the tree stores blocks while the bottom level stores individual words. 
Alternatives techniques are widely reported in the document analysis literature for seg- 
menting document images with different characteristics from the Fig. 1 example, such 
as the other examples considered in our experiments (refer forward to Figs. 4 and 5). 
In Fig. 1, the ORIGINAL SPELLING etc. stamp is a redundant artifact of no archival 
value. Digital removal of such features before transmission is described in [9], and, 
hence, is not considered herein. 

3 Generating the Region of Interest Map 

Because each text ROI is rectangular in shape, the calculation of the ROI map can 
be dramatically reduced, adapting an existing algorithm [10] for that purpose. As in 
a conventional wavelet-decomposition, once the ROI map is generated at the image 
scale, the identification process is recursively repeated at each lower subband, until 
the predefined maximum level depth is reached. The exact mechanism for generating 
the ROI map is related to the wavelet algorithm chosen. The well-known 9/7-tap filter 
serves as an example to explain how to generate the interest map. 
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Suppose there is a one dimensional (1-D) ROI pixel seqnence, denoted as X{n). 
After the 9/7 tap wavelet transform, the low- and high-frequency wavelet coefficient 
sets, denoted respectively L{.) and H{.), that are responsible for the reconstrnction of 
X{n) are; 

if n is even, with m = n/2 

L{.) = {L{m — 1), L{m), L{m+ 1)} 

H{.) = {H{m - 2), H{m - 1), H{m), H{m + 1)} 

else if n is odd, let m = (n-l)/2 

L{.) = {L(m — 1), L{m), L{m +1), L{m + 2)} 

H{.) = {H{m - 2), H{m - 1), H{m), H{m + 1), H{m + 2)} 

Because the 2-D wavelet transform is separable, a 2-D mapping can be identified 
from 1-D mappings in the horizontal and vertical direction, which correspond to L{.) 
and H{.), depending on subband level. For any one pixel in the ROI, let ) and yL(.) 
represent the sets of Cartesian coordinates from the coefficient set C(.) (and similarly 
for )and yH(.))^ then four displaced rectangular regions can be identified as their 
direct product coefficient coordinate sets: 

XL(.)®yL(.),XL(.)®yH{.),XH{.)®yL{.), and o (1) 

Each pixel in an ROI generates four similar wavelet coordinate coefficient sets, and 
the four contributing subband regions are the nnion of the corresponding coordinate 
coefficient sets of all pixels. 

Rather than applying (1) to each pixel in the ROI, by determining the wavelet co- 
efficients corresponding to just the top left and bottom right corner pixels of each rect- 
angular ROI [10], the complete set of displaced rectangular regions can be generated 
in a simplified manner. Let(a;m, yn), respectively the top left and bottom 

right corners of the text region. Apply (1) to each of these two corners, with (mmin, 
’a^min) representing in tnrn the coordinate of the top-most left coefficient in each of the 
four subband regions generated by {xm, yn), and (/cmax, ^max) similarly corresponding 
in turn to the bottom-most right coefficient of each of the four subband regions gener- 
ated by {xk, yi), then, for each of the fonr subband regions, the coordinates correspond 
to the set 

{ 1/p) IrUniin h /^max! ^min ^ ^max} (2) 

Figure 2 is an example showing a 3 x 3 rectangular coefficient region, with (1) per- 
formed on the two corner coefficients to generate two groups of four sets in each sub- 
band region. The extremes of these sets identify the desired subband regions. 



4 Region of Interest Technique 

Once the ROI map is generated, all the wavelet coefficients within that region need 
to be inclnded in the compressed bitstream. Several bit-plane shifting algorithms are 
available [10] for making these coefficients appear as early as possible. Alternatively 
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Fig. 2. Using two comer pixels to generate subband regions, showing a single-level subband 
decomposition. 



since the ROI algorithm performance is encoder dependent. Park & Park [11] avoid bit- 
plane shifting for the SPIHT encoder [12], as by mixing ROI and non-ROI coefficients 
in the same bit-plane through shifting, some correlations, which are otherwise present, 
cannot be used to reduce the bit-rate. The research in [11] utilizes: 1) a parent of ROI 
(PROl) mask; and 2) the use of per-bit-plane multi-lists, which retain information on 
tested coefficients that lie outside the ROI, thus preventing information wastage in later 
encoding rounds. 

The work reported here uses a similar method to [11], but with some further en- 
hancements, and adaptations for document coding, where neither an ROI, nor PROI 
map is needed, because each ROI is a rectangular box, and can be represented simply 
as two corner coordinates (Section 3). Instead, a dynamically applied function deter- 
mines (using the stored corner coordinates) whether the current coefficient or set of 
coefficients is or contains an ROI coefficient. This modification considerably reduces 
the memory that would otherwise be needed if ROI and PROI maps were to be used. 
To avoid excessive modification of the SPIHT algorithm, a single list was used rather 
than the multi-lists of [11]. This is achieved by simply adding a coefficient bit-plane 
level indicator, in addition to the coefficient co-ordinates normally held in SPIHT’s sig- 
nificance lists. The indicator records the bit-plane level at which a coefficient could 
become significant during later processing rounds. Neither of these two changes affects 
the performance of the SPIHT algorithm. 



5 ROI Algorithm Description 

Given a wavelet-transformed image, X, let n„iax = Uog 2 (max({ci_j}))J nmax = 
[log 2 (max({ci^j}))J , Cij G X. In the same manner as SPIHT, define sets D{b)as all 
descendants for coefficient b, and L{b) as all descendants except the direct descendants. 
Also, define three ordered auxiliary lists: (i) LIP to contain insignificant pixels; (ii) 
LIS to contain insignificant sets; and (iii) LSP to contain significant pixels/coefficients. 
Entries in LIS can be of type A or B (corresponding to D or L type descendants). 

Two parameters, which avoid shifting, are user definable; these are p and r, as 
shown, together with matching bit-planes thresholds Umax to no_in Fig. 3. The first 
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Fig. 3. This figure shows that after pth bitplane encoding, the ROI will be encoded r bitplanes 
earlier than the rest of image. 



p bitplanes, indexed 0 to p — 1, are encoded using SPIHT over the whole image. Sub- 
sequently, the ROI only is encoded for r bit-planes, indexed piop + r — 1, and the re- 
sulting bits are transmitted. Then, the complete transform image, apart from the ROI, is 
encoded for the same r bit-planes, reusing significance information from the ROTonly 
encoding rounds. Finally, the remaining bit planes, indexed p + r to max, are encoded, 
using SPIHT over the whole image. For bitplane q, when p<g<p + r, a modi- 
fied SPIHT is applied, in a similar manner to Fig. 7 in [11]. However, unlike [11], all 
decisions about membership of an ROI or PROI are replaced by a ‘judgement’ function: 



Judge_RoI ( {i,j),type) 

{ 

case type: 



coefficient. check whether Cijbelongs to Rol ; 
if so, return true; 

A\ check whether any member in D(cij) belongs 
if so, return true; 

B\ check whether any member in L{a,j) belongs 
if so, return true; 



to Rol ; 
to Rol ; 



} 
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The judgement is made either on a single coefficient in a LIP or LSP, or on coefficient 
sets of type A or B in a LIS. Initially (in round 0), all coefficient bit-plane indicators are 
set to zero. In each bitplane round, once a coefficient’s significance has been tested, if it 
is significant then its indicator value will be increased. For example in bitplane q, sup- 
pose D{i,j) is found to be significant and JudgeJioI {{i, j) , A) is true. Then, for each 
of C2i,2j , C2i+i,2j, C2 j,2j+i , C2i+i,2j-i-i. and L{i,j), if it belongs to the ROI map, test its 
significance, and if significant, increment its significance indicator to g -f 1. If a coef- 
ficient is not in the ROI map there is no need to test significance, and the significance 
indicator remains as q. Thus, the algorithm can precisely record in which bitplane round 
each coefficient becomes significant. In fact, significance counting is not confined to the 
ROI bitplane rounds, but is used for all bitplane rounds. 

For the remaining full bitplanes after the ROI rounds, special treatment should be 
accorded to the coefficients or sets of coefficients that indicate possible significance. In 
bitplane q, only those coefficients having a significance indicator equal to q are tested, 
because it is already known that other coefficients are not significant in this cycle. 

6 Experimental Results 

DjVu [2] was compared with the proposed ROI method. ROFs were implemented using 
IW44 , SPIHT [12], and JPEG2000 [11] algorithms. JPEG2000 testing was based on the 
source code of JJ2000', which is recommended on the JPEG official webpage. IW44 is 
the wavelet encoder used for DjVu background images. Because the parameter r, used 
to control the relative importance between foreground and background as explained in 
Section 4, can range from 0 to Vmax, it was impractical to test the impact of all possible 
values. Thus, a middle value of 4, and a high value of 8 were respectively chosen. Eor 
SPIHT, the equivalent of the max-shift method [10] was also implemented by setting 
r = Traax- The main advantage of the max-shift method is that the interest map doesn’t 
need to be sent to the decoder, but, in this case, the information about ROIs is just a few 
coordinates. Therefore, no obvious advantages for the max-shift method are shown by 
the test results. The Peak Signal-to-Noise (PSNR) figures are compared in two ways, 
first for the whole image, and secondly for the ROI areas only. 

The test images are chosen from three possible application areas, by considering 1) 
both Western and Asian images, 2) text only and text-drawing images, and 3) printed 
text and handwritten text. 

The first image used as an example of our results was randomly chosen from the 
card archives held at the Natural History Museum in London. As shown in Fig. 1, five 
separate ROI areas were identified. In Fig. 1 and subsequently, r = rmax, to illustrate 
the effect more clearly. Two representative rate points are illustrated in the results, a 
low rate (at 0.075 bpp) and a relatively high rate (at 0.28 bpp). Test results are given 
in Table 1. From those data, it can be seen that the SPIHT ROI generally gives better 
objective performance than DjVu. ROI image quality is improved when r is increased, 
trading off against the background image quality. This can be seen by the comparison 
between SPIHT _roi(4) and SPIHT_roi(8). The subjective image quality of the SPIHT 

* Java implementation of JPEG 2000, available at ht tp : / / j peg2 0 0 0 . epf 1 . ch/ (last ac- 
cessed xi 14 03) 
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ROI is comparable with DjVu, as shown hy Fig. 4. Although image b in Fig. 4 contains 
some obvious artifacts, it is also more readable than image a, especially if blown up so 
that detail is visible. Image c seems more readable than image d, but this is caused by 
the loss of many details in the background. For the detailed text region (zoomed 300% 
in the left top corner of the test image), the proposed method is subjectively closer to the 
original image. Also recall from Section 2 that the stamp would normally be removed 
before transmission. 



Table 1. PSNR comparisons of each ROI method for the archive card example. Fig. 4, the ‘Jef- 
ferson’ letter. Fig. 5, and the Chinese ancient art. Fig. 6. 



Image 


Bit-rate 


Method: 


DjVu 


IW44 


IW44. 

roi(4) 


IW44. 

roi(8) 


SPIHT 


SPIHT. 

roi(4) 


SPIHT. 

roi(8) 


SPIHT. 

roi(max) 


JPEG2000. 

roi(max) 


Figure 4 






mesi 








27.11 


27.10 


22.47 


22.47 


25.57 (dB) 


ROI only 


HdtiW 








^^1 


22.31 


24.56 


24.56 


23.16 (dB) 


0.28 bpp 




iU 








35.19 


35.33 


30.16 


23.09 


28.34 (dB) 














31.00 


34.40 


34.99 


33.08 (dB) 


Figure 5 














34.65 


34.65 


30.53 


30.53 


13.12 (dB) 




tfiifci 


laiiii 


WiKill 




31.25 


31.25 


32.20 


32.20 


33.25 (dB) 


0.139 bpp 


Whole 


Buiaii 


QSS 






Eftwa 


40.76 


40.75 


31.52 


13.14 (dB) 












39.21 


39.21 


39.34 


40.57 


41.15 (dB) 


Figure 6 


0.028 bpp 


Whole 










23.02 


23.02 


23.00 


23.00 


23.96(dB) 


ROI only 


IBBII 


lEwa 




■bItM 




19.39 


19.40 


19.40 


19.38 (dB) 






Mrjna 


gQI 


BQI 




■ifcffel 


33.09 


32.27 


32.27 


32.82 (dB) 








waitM 






30.23 


30.44 


30.44 


30.39 (dB) 



Another test image was chosen from the Thomas Jefferson papers at the Library of 
Congress; test data in Table 1 shows similar results, and the subjective performance is 
compared in Fig. 5, confirming the results from the archive card image. 

An example of Chinese ancient art, which contains a drawing, was also chosen as a 
test image, and a set of visual comparisons are shown in Fig. 6. Zoomed-in detail shows 
the advantage gained from using an ROI method for the drawing. In the low-resolution 
results, the text decoded by DjVu looks sharper than when using the ROI compression 
method. However, the cost of using DjVu is that the quality of the drawing part degrades 
more than by the ROI method. This weakness is the result of Dj Vu’s multi-layer method. 
During segmentation processing, DjVu performs well within the text region, but some- 
times relegates some parts of the drawing to the background layer. The background 
layer is treated as a lower quality layer in DjVu, leading to the quality of the decoded 
drawing being reduced. In contrast, the ROI method avoids this problem. By selecting 
individual regions, all the useful regions are well protected, no matter whether they are 
text or drawing regions. In high-resolution compression, both text and drawing regions 
in the ROI method look slightly better than DjVu, in spite of some artifacts around the 
edges. The objective results. Table 1, correspond to the above subjective comments in 
respect to the high-quality results. The low-resolution results actually show the advan- 
tage of the SPIHT over DjVu and not any particular advantage at low resolution for the 
ROI method. This may be because SPIHT has suppressed some high-frequency artifacts 
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Fig. 4. Visual image quality comparison: (a) DjVu at 0.075 bpp, (c) DjVu at 0.28 bpp (b) 
SPIHT_roi_max at 0.075 bpp, (d) SPIHT_roi_max at 0.28 bpp, (e) 300% zoom of the top left 
corner of (c), (f) 300% zoom of the top left comer of (d). 



in the background of the image, whereas DjVu has included these high-frequency arti- 
facts in the foreground layer. However, for low-resolution images, SPIHT is as good at 
the suppression of high-frequency components as the ROI method. The high-frequency 
artifacts are mostly located at the bottom of the image, where a close inspection shows 
random fluctuations of image intensity. 

7 Conclusions 

In this paper, a simple block-based document segmentation algorithm, combined with 
ROI methods applied to wavelet coding, is proposed as an alternative to current multi- 
layer document coding methods that apply different coding techniques to each layer. 
The proposed approach can be applied to a number of wavelet coding algorithms: the 
paper compares objective PSNR and subjective visual performance of IW44, SPIHT 
and JPEG2000 implementations with a well-known commercial multi-layer coding al- 
gorithm (DjVu). By taking advantage of the simple representation of ROIs using corner 
coefficients, further optimisation of SPIHT is possible, resulting in a computationally 
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(d) 

Fig. 5. Thomas Jefferson Papers Series 1: letter extract (a) generated by DjVu (b) generated by 
the ROI method (c) zoomed version of (a) (d) zoomed version of (b). 



efficient scalable coder with favourable performance at very low bit rates. This SPIHT 
implementation may be particularly appropriate for use in application areas such as 
mobile devices, where display resolution, communications bandwidth and/or process- 
ing capacity are limited, and direct hardware implementation of the coding algorithm 
may be required. 
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Fig. 6. Chinese ancient art: (a) original image (b) DjVu at 0.028 bpp (c) DjVu at 0.104 bpp (d) 
SPIHT_roi_max at 0.028 bpp (e) SPIHT_roi_max at 0.104 bpp (f) Zoomed in version of (c) (g) 
Zoomed in version of (e). 
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Abstract. This paper describes PLANET, a recognition method to be applied on 
Arabic documents with complex structures allowing incremental learning in an 
interactive environment. The classification is driven by artificial neural nets 
each one being specialized in a document model. The first prototype of 
PLANET has been tested on five different phases of newspaper image analysis: 
thread recognition, frame recognition, image text separation, text line recogni- 
tion and line merging into blocks. The learning capability has been tested on 
line merging into blocks. Some promising experimental results are reported. 



1 Introduction 

In the field of document recognition many improvements have been made during the 
last decade. However, despite three hundred millions people around the world using 
Arabic language daily, there is still a lack especially in recognizing complex structured 
Arabic documents, such as newspapers or magazines. The major difficulty of such 
kind of documents is the variability of layout between newspapers and even different 
issues of the same newspaper. 

The first methods of layout analysis for documents using Latin language focused on 
document structures [6, 10]. Recent works show a great interest in complex layout 
analysis [1, 2, 3] and also recommend the use of learning-based algorithms [5]. 

Currently known approaches rely on document models, which are either set up by 
hand or generated automatically in a previous learning step that needs a lot of ground- 
truthed data [8]. The drawback is that such models do not accommodate easily to new 
situations, a very common condition when dealing with complex document structures 
due to the variability of layout. 

The Arabic language is known to be a difficult language for character recognition 
because of its specific features such as: Arabic alphabet is much richer than the Latin 
one, the form of the letter changes depending on its position inside the word, the 
words are written from right to left. 

Therefore we tried to test the performance of the well known algorithms of seg- 
mentation and adapted them in order to treat complex structured Arabic documents 
newspapers [4]. Despite the encouraging results obtained, we believe that interactive 
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incremental learning is an important issue in this context. It is one of the main goals of 
the CIDRE* project, which aims at building a semi automatic document recognition 
system that constructs its knowledge incrementally through interactions with the user. 

In this paper we introduce PLANET, which stands for Physical Layout Analysis of 
Arabic documents using artificial neural NETs. It is an interactive recognition method 
for complex structured Arabic documents based on incremental learning. 

This paper is organized as follows: in section 2 we present the principles of 
PLANET. Section 3 is devoted to the experimental part and the results obtained. Fi- 
nally, in section 4 we conclude our work and give some perspectives for future work. 



2 Principles 

Up to now there is a shortage of tools either for the physical layout or for the logical 
layout extraction. These tools will help us building ground-truthed repositories useful 
for the document community. Nevertheless building them to behave in automatic way 
is not the right trend. First, the documents are getting more and more complex (differ- 
ent layouts, more colors, more typography ...). Second it is hard to model a tool re- 
solving all the problems of segmentation. It is more appropriate to build a tool treating 
all the phases of the physical layout and allowing incremental learning. 

For these purposes PLANET has, been constructed to allow: 

• physical layout extraction (thread extraction, frame extraction, image text separa- 
tion, text line recognition and line merging into blocks) 

• improvement of the recognition ratio through an interactive incremental learning 
phase. 

PLANET is composed of four sequential phases as illustrated in figure 1. 
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Fig. 1. Architecture of PLANET. 



In the following subsections we review the four phases. 

2.1 The Segmentation Phase 

Our segmentation phase performs the following steps: thread extraction, frame ex- 
traction, image text separation, text line extraction and line merging into blocks. 

' CIDRE stands for Cooperative and Interactive Document Reverse Engineering and is sup- 
ported by the Swiss National Fund for Scientific Research, code 2000-059356.99-1 
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Threads are an essential part of the layout structure of newspaper; they serve as sepa- 
rators between columns of text or between different articles. Frames are special kind 
of paragraphs; they are paragraphs surrounded by rectangles. Our algorithm is mod- 
eled as a set of tools that can be used separately. A bottom-up approach based on con- 
nected components is used for image, thread and frames extraction. Connected com- 
ponents are also used for text line extraction after the use of RLSA. The line merging 
into blocks is done according to rules. More details about the thread extraction, frame 
extraction, image text separation, text line extraction and line merging into blocks are 
presented in [4]. The output of our segmentation algorithm is an XML file which de- 
scribes the segmentation results concerning different components such as threads, 
images, frames, text lines extraction and line merging into blocks. A sample of the 
XML segmentation output is illustrated in figure 2. 



<?xml version="l . 0" encoding="UTF- 8" ?> 
<segmentation image="Annahar_15_ll_2003 . tif "> 

<Threads> 

<Thread x="827" y="955" w="3054" h="3" /> 

</Threads> 

<Images> 

<Image x="361" y="1054" w="419" h="318" /> 

</ Images > 

<Texts> 

<Text x="413" y="15" w="8" h="l" /> 



</Texts> 

<Frames> 

<Frame x="1010" 
</ Frames > 
<Blocks> 

<Block x="1010" 



y="4114" w="627" h="1082" 
y="4114" w="627" h="1082" 



/> 



/> 



</Blocks> 

< / s egmen t a t i on> 



Fig. 2. A sample of the XML segmentation output. 



2.2 The Correction Phase 

The correction phase, which follows segmentation is accomplished through an interac- 
tive process where users are able to interactively correct segmentation errors generated 
in the previous phase. We consider either over- segmentation or under- segmentation 
errors. For this purpose we use xmillum, which is our framework for cooperative and 
interactive analysis of document, which allows to visualize and to edit document rec- 
ognition results expressed in any XML language [7]. User actions consist of merging 
or splitting in both directions, horizontally and vertically. This phase allows the crea- 
tion of the training set needed by the artificial neural nets in the next phase. Figure 3 
shows a screenshot of the correction with xmillum. 
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Fig. 3. Correction of the segmentation errors. 



2.3 The Training Phase 

The training phase follows the correction phase; at this level the training set is build 
for the artificial neural nets. In fact, we construct a training set for the initial document 
model. The initial document model is composed of samples of the correctly segmented 
blocks and samples of the user corrected blocks of each newspaper. Then, the artificial 
neural net is trained with the obtained training set. After that, the trained artificial 
neural net is copied three times, one for each newspaper, and trained with the corre- 
sponding newspaper training set. Figure 4 illustrates all the steps of the training phase. 




Segmentation 



Initial document 
model 



Document 

models 



Fig. 4. The training phase. 



2.4 The Test Phase 

The test phase follows the training phase; at this level the three artificial neural nets 
are tested and their performances are evaluated. After segmenting each sample of each 
newspaper, a test file is constructed. This latter is composed of patterns containing 
features. The choice of these features used inside the artificial neural nets is described 
in the following subsection. Each artificial neural net generates an output file contain- 
ing the computed output for each pattern. This output file is visualized by xmillum. 
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The neural net simulator used is Java Neural Nets Simulator (JavaNNS) [9]. Figure 5 
shows an overview of our artificial neural net under JavaNNS. 




Fig. 5. Our artificial neural net under JavaNNS. 



2.4.1 The Choice of Features 

The choice of features is an important step; it leads generally to a good recognition 
ratio. At the beginning, our first goal was to improve the recognition ratio of merging 
lines into blocks obtained at the segmentation phase. The output of the line merging 
into blocks is a set of blocks containing errors due to either over- segmentation or un- 
der-segmentation. In order to correct such errors, we introduce to the artificial neural 
net this set by pair of blocks. Each block has the following features: 

• width, height, black pixel density, white pixel density and connected component 
ratio, 

• the distance between both blocks. 

Our artificial neural net, a Multilayer Perceptron (MLP) [11], is connected in a 
feed-forward way and it is composed of three layers: input, hidden and output layer. 
The input layer is composed of 11 neurons; the first five neurons correspond to the 
features of the first block whereas the next five are those of the second block and the 
last neuron is the distance between the blocks. In order to determine the number of 
neurons of the hidden layer after multiple tests, we find that 10 neurons is the best 
configuration. Finally the output neuron is composed of 1 neuron stating whether to 
merge or to keep the blocks. A threshold is used inside the output layer of each artifi- 
cial neural net in order to separate between the two output classes: merge and keep. 



3 Tests and Results 

Our first PLANET prototype has been implemented in Java with the integration of the 
neural net simulator JavaNNS. The segmentation and the features extraction took less 
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than thirty seconds whereas the learning phase inside JavaNNS is less than two min- 
utes on a Pentium 4. 

It has been tested on a set of pages from three Arabic newspapers: Annahar, AL 
Hayat and AL Quds. Figure 6 illustrates a page sample of AL Quds Arabic newspaper. 







Fig. 6. Page sample of AL Quds newspaper. 

The evaluation of PLANET has been performed on 90 pages of Annahar, AL Hayat 
and AL Quds newspaper. Table 1 shows the result of recognition for the AL Quds 
newspaper before the training. The corresponding results for Annahar and AL Hayat 
newspapers are illustrated in [4]. 



Table 1. AL Quds recognition results. 



% 


Thread 


Frame 


Image 


Text line 


Line merging into blocks 


AL Quds 


95,551 


95,158 


94,092 


92,869 


91,765 



The low recognition rate for the text line is essentially due to a recurrent error 
which causes an ambiguity especially when diacritics of the first line and those of the 
second line are near to each other or merged. Since, the recognition rate of line merg- 
ing into blocks is based on the results of the text recognition; the errors of the text line 
recognition are propagated to the next processing steps. 

3.1 PLANET Training Phase 

The training set for the initial document model is composed of samples of the well 
segmented blocks and samples of the user corrected blocks of three newspapers (5 
pages from each one). After training the initial document model and duplicating it, one 
for each newspaper, we train again each artificial neural net with the corresponding 
newspaper training set. The number of pages used in the training phase for each news- 
paper is 25 pages for Annahar, 10 pages for AL Hayat and 10 pages for AL Quds. 
Figure 7 shows the PLANET recognition rate vs. the number of manipulations done by 
the user for the three newspapers. 
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Fig. 7. PLANET recognition rate vs. the number of manipulation done by the user for the Anna- 
har, AL Hayat and AL Quds newspapers. 



The figures illustrated above are obtained this way: for each sample of one newspa- 
per we test the recognition rate and we correct the misclassified patterns. Then we 
introduce these corrected patterns to the training set and we train again the artificial 
neural net. 

From our observation we notice that there is a big variation of the layout for Anna- 
har newspaper which corresponds to the variation of the recognition rate. This varia- 
tion is less important for AL Hayat newspaper whereas for AL Quds is stable at the 
end. Table 2 shows PLANET average recognition rate for the three newspapers in the 
training phase. 



Table 2. PLANET average recognition rate obtained in the training phase. 



% 


Annahar 


AL Hayat 


AL Quds 


Average recognition rate 


97,549 


99,108 


98,682 
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3.2 PLANET Test Phase 

In the test phase we measure the performance of PLANET. A test set for each newspa- 
per is constructed and tested within PLANET. First of all we have tested the initial 
document model with the training and test set. Table 3 shows PLANET average recog- 
nition rate obtained for the initial document model with the training and test set. 



Table 3. PLANET average recognition rate for the initial document model. 



% 


Annahar 


AL Hayat 


AL Quds 


Average recognition rate 


96,667 


96,857 


96,572 



Then we tested for each document model the recognition ratio with its correspond- 
ing test set. Table 4 shows PLANET average recognition rate with the test set. 



Table 4. PLANET average recognition rate in the test phase. 



% 


Annahar 


AL Hayat 


AL Quds 


Average recognition rate 


97,963 


98,343 


99,051 



First when comparing the results obtained in Table 3 and Table 4, we notice an im- 
provement of the recognition rate for the line merging into blocks with the introduc- 
tion of the learning in each document model. The following table illustrates the recog- 
nition rate obtained before and after the learning phase and the ratio of the 
improvement. 



Table 5. PLANET results of recognition for Annahar, AL Hayat and AL Quds. 



% 


Annahar 


AL Hayat 


AL Quds 


Segmentation only 


95,217 


91,438 


91,765 


Segmentation with learning 


97,963 


98,343 


99,051 


Ratio of improvement 


2,348 


5,167 


8,677 



We have also tested the cross recognition ratio: for example test the recognition ra- 
tio of Annahar with both the training sets of AL Hayat and AL Quds and vice-versa. 
Table 6 illustrates all the possibilities of the cross recognition ratio. 



Table 6. PLANET cross recognition ratio. 



Training set / Test set 


Annahar 


AL Hayat 


AL Quds 


Annahar 


97,963 


98,163 


97,217 


AL Hayat 


95,630 


98,343 


96,336 


AL Quds 


96,632 


98,081 


99,051 



The analysis shows that the Annahar, AL Hayat and AL Quds models are more 
specialized since the recognition ratio is decreasing when the training and the test set 
are not from the same newspaper. In fact each model has learned its own specific 
features. 
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4 Conclusion 

In this paper we describe a document layout analysis method featuring interactive 
incremental learning named PLANET. Encouraging experimental results are reported 
concerning its application for recognizing complex structured Arabic documents. 

The classification is driven by artificial neural net specialized in a document model. 
The first prototype of PLANET has been tested on five different phases of newspaper 
image analysis: thread recognition, frame recognition, image text separation, text line 
recognition and line merging into blocks. The learning capability has been tested on 
line merging into blocks. 

We believe that PLANET can be used successfully as a tool to build ground-truthed 
repositories: users can, through some mouse clicks, easily correct segmentation errors 
and produce ground-truthed datasets. 

Our future work on PLANET will focus the improvement of the model. We would 
like to test PLANET with other types of documents for example those in the Latin 
language, and may be other applications. 
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Abstract. In this paper we present an integrated approach for semantic structure 
extraction in document images. Document images are initially processed to ex- 
tract both their layout and logical structures on the base of geometrical and spa- 
tial information. Then, textual content of logical components is employed for 
automatic semantic labeling of layout structures. To support the whole process 
different machine learning techniques are applied. Experimental results on a set 
of biomedical multi-page documents are discussed and future directions are 
drawn. 



1 Introduction 

The increasingly large amount of paper documents to be processed daily in office 
environments requires new document management systems with abilities to automati- 
cally catalog and organize these documents on the basis of their contents. Personal 
document processing systems that can provide functional capabilities like classifying, 
storing, retrieving, and reproducing documents, as well as extracting, browsing, re- 
trieving and synthesizing information from a variety of documents are in continual 
demand [7]. In order to extend these capabilities to paper documents it is necessary to 
convert them into a suitable electronic format. This can be done by applying knowl- 
edge technologies, such as machine learning tools or knowledge representation lan- 
guages. They are so relevant, that some distinguished researchers claimed that docu- 
ment image analysis and understanding belong to a branch of artificial intelligence 
[16], despite the fact that most of the contributions fall within the area of pattern rec- 
ognition [12]. 

In this paper we present a Document Image Analysis (DIA) framework with a 
knowledge-base architecture that supports all the processing steps required for the 
semantic structure extraction from document images. More precisely, this results from 
a tight integration of the system WISDOM-H-, which performs document understand- 
ing on the basis of geometrical information, with the content-based classification ca- 
pabilities provided by the system WebClassII [4]. WebClassII is a client-server appli- 
cation that performs the automated classification of Web pages on the basis of their 
textual content. WebClassII preprocessing and classification modules have been inte- 
grated in the proposed framework and applied to the textual content of logical compo- 
nents of interest extracted by WISDOM-H-. 
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In WISDOM++, document understanding consists in the mapping of the geometri- 
cal structure, extracted during the layout analysis step, in the logical structure, which 
associates the content of the layout component with a hierarchy of logical components, 
such as title/authors of a scientific article. The mapping is based on the assumption 
according to which in many documents the logical and the layout structures are so 
strongly related that a user is able to “understand” the document without reading the 
content itself but only considering geometrical criteria. For instance, the title of an 
article is usually located at the top of the first page of a document and it is written with 
the largest character set used in the document. 

Nevertheless, current results in document management research highlight that the 
enhancement of information retrieval systems performance is strongly dependent on 
the use of semantic information about the textual content of documents [14]. In our 
proposal the upgrade of WISDOM-h- that aims to support document understanding 
with content-based classification functionalities has been investigated. 

The paper is organized as follows. In the next section the DIA system WISDOM-h- 
is briefly described and some related works on document understanding are discussed. 
In Section 3 the layout analysis method is introduced and the document understanding 
strategy is extensively reported. Section 4 describes the semantic structure extraction 
approach focusing on the preprocessing and classification methods. Experimental 
results are shown in Section 5 and future work is presented in Section 6. 

2 Background and Related Work 

WISDOM-h-* is a document analysis system that can transform textual black and 
white paper documents into XML format [2]. This is a complex process involving 
several steps. First, the image is converted in black and white and is segmented into 
basic layout components (non-overlapping rectangular blocks enclosing content por- 
tions) by means of an efficient variant of the Run Length Smoothing Algorithm. 

These layout components are classified according to the type of their content: text, 
horizontal line, vertical line, picture and graphics. This classification is performed by 
means of a decision tree automatically built from a set of training examples of the five 
classes. Then, layout analysis is performed in order to detect structures among blocks. 
The result is a tree-like structure which is a more abstract representation of the docu- 
ment layout. This representation associates the content of a document with a hierarchy 
of layout components, such as blocks, lines, and paragraphs. Considering the extracted 
layout, the document image classification is performed. This aims at identifying the 
membership class (or type) of a document (e.g. business letter, newspaper article, and 
so on) by means of some first-order rules which can be automatically learned from a 
set of training examples [10]. In document Image understanding, layout components 
are associated with logical components. This association can theoretically affect layout 
components at any level in the layout hierarchy. However, in WISDOM-h- only the 
most abstract components (called /rame2) are associated with components of the logi- 
cal hierarchy. Moreover, only layout information is used in document understanding. 



* http://www.di.uniba.it/~malerba/wisdom-H-/ 
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This approach differs from that proposed by other authors [9] which additionally make 
use of textual information, font information and universal attributes given by the OCR. 
This diversity is due to a different conviction on when an OCR should be applied. We 
believe that only some layout components of interest for the application should be 
subject to OCR, hence document understanding should precede text reading and can- 
not be based on textual features. Two assumptions are made: documents belonging to 
the same class have a set of relevant and invariant layout characteristics; logical com- 
ponents can be identified by using layout information only. Document image under- 
standing also uses first-order rules [10]. Once logical and layout structure have been 
mapped, OCR can be applied only to those textual components of interest for the ap- 
plication domain, and its content can be stored for future retrieval purposes. The result 
of the document analysis is an XML document that makes the document image re- 
trievable. 

Similarly to [1], we propose an integrated approach for document understanding 
that is based on both geometric and content features and makes extensive use of 
knowledge. In particular, [1] consider document understanding as the extraction of 
logical relations on layout objects, where logical relations express reading order be- 
tween layout components, and extensive use of spatial reasoning and natural language 
processing techniques is made to support the extraction of reading order relations. 
Differently from this approach we consider document understanding as logical struc- 
ture detection combined with a mapping into a set of predefined semantic classes. 

As clearly explained in [6], research trend in document understanding turns towards 
the need of document management frameworks which should be able to handle differ- 
ent typologies of documents and to employ hybrid strategies for knowledge capture in 
order to handle different dimensions of information (e.g. textual, layout, format, tabu- 
lar, etc.). In particular, it seems that document management systems will not give 
answers to real-world needs until they continue to tailor their solutions for the individ- 
ual sub-problems in which the whole process of document understanding can be ar- 
ticulated. Our proposal fits this scenario and tries to answer some of the mentioned 
issues. 

3 Document Layout Analysis and Understanding in WISDOM-h- 

Before entering into the details of semantic structure extraction, it is important to clar- 
ify the document analysis steps performed by WISDOM-t-t, and to briefly explain how 
machine learning techniques are employed to guarantee a high degree of adaptivity. 

Considering the output of the segmentation and blocks classification, it results that 
this representation is still too detailed for learning document classification and under- 
standing rules. Therefore, the layout analysis is performed to detect structures among 
blocks. The result is a hierarchy of abstract representations of the document image. 
The leaves of the layout tree are the blocks, while the root represents the whole docu- 
ment. A page may group together several layout components, called /ramej, which are 
rectangular areas of interest in the document page image. The layout analysis is done 
in two steps, a global analysis of the document image in order to determine possible 




182 Margherita Berardi, Michele Lapi, and Donato Malerba 



areas containing paragraphs, sections, columns, figures and tables and a local analysis 
to group together blocks which possibly fall within the same area. Perceptual criteria 
considered in this last step are: proximity (e.g. adjacent components belonging to the 
same column/area are equally spaced), continuity (e.g. overlapping components) and 
similarity (e.g. components of the same type, with an almost equal height). Pairs of 
layout components that satisfy some of these criteria may be grouped together. The 
layout structure extracted by WISDOM++ is a hierarchy with six levels: basic blocks, 
lines, set of lines, frame 1, frame2, and pages. The last level is specified by the user, 
while the first five are automatically extracted. If the user is not satisfied with the 
result of the layout analysis he/she can act directly on the results of the segmentation 
process by deleting some blocks or he can modify the result of the global analysis by 
performing some splitting or grouping operations, which are later used to learn rules 
for the automated correction of the layout analysis [3]. 

Some logical components of the document, such as title and authors of a paper, can 
be identified after having detected the layout structure. They can be arranged in a 
hierarchical structure, which is called logical structure, resulting by a division of the 
content of a document into increasingly smaller parts. The leaves of the logical struc- 
ture are the basic logical components, such as authors of a paper. The heading of an 
article, encompassing the title and the author, is an example of a composite logical 
component. The root of the logical structure is the document class. The discovery of 
the logical structure of a document can be cast as the problem of associating some 
layout components with a correspondent logical component. In WISDOM-h- it con- 
sists in the association of a page with a document class {document classification) and 
of frame2 layout components with basic logical components {document understand- 
ing). Classification is performed by matching the layout structure of the first page 
against models of classes of documents that are able to capture the invariant properties 
of the images/layout structures of documents belonging to the same class. Document 
understanding of all pages is performed by matching the layout structure of the each 
page against models of logical components. An example of models for the logical 
components runningjiead and paragraph in the case of scientific papers might be: 

logic_type(Y)= paragraphs— on_top(X,Y)=true, 

logic_type(X)=running_head, type(Y)=text 

These rule means that a textual layout component below a running-head is a para- 
graph of the paper. Further details on document image understanding are reported in 
[ 11 ]- 

4 Semantic Structure Extraction 

Due to the limits of a layout-based indexing, WISDOM-h- has been integrated in a 
framework which supports the indexing phase by using relevant terms automatically 
extracted from logical components of interest. The goal is to select from the ocr-ed 
text of a logical component those terms that allow the system to assign the logical 
component to a semantic class of a set of predefined domain-dependent classes. 
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As shown in Fig.l, in this framework the output of WISDOM++, that is logical la- 
beling and ocr-ed text of logical components, is used as input for WebClassII. In par- 
ticular, the feature extractor module selects relevant terms from training textual com- 
ponents through which the classifier is learned. It is noteworthy that output of the 
automatic logical labeling performed by WISDOM-h- might be eventually subject to 
users’ correction and used as WebClassII input. Users corrections are useful both to 
generate new training examples for logical labeling learning and to preserve WebClas- 
sll classifier learning. The output of WebClassII modules is the tagging of those se- 
mantic components that are not classifiable through geometrical criteria (e.g. the sec- 
tion about the Methods of a scientific paper) but only supporting the process with a 
semantic -based classification step. 




Fig. 1. DIA Framework Architecture. 



WebClassII is a client-server application that performs the automated classification 
of Web pages on the basis of their textual content. The automated classification of 
documents, in general, requires the solution of the problem of the representation lan- 
guage definition and of classifier construction tailored on representation language. 
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WebClassll adopts a feature vector representation of documents, where each single 
feature corresponds to a distinct term extracted from the training documents. Features 
are determined by means of a complex preprocessing phase, which includes both a 
term extraction and a term selection process. As regards the classifier construction 
problem, WebClassll has two alternative ways of assigning a Web page to a class, a 
centroid-based method and a naive Bayesian method, and both of them require a train- 
ing phase by means of which the classifier is learned. 

In our specific context, we use the preprocessing and the classification modules as 
common text document processing modules without taking into account WebClassll 
Web document management capabilities. 

4.1 The Feature Extractor Module 

Initially, all training documents are tokenized, and the set of tokens (words) is filtered 
in order to remove punctuation marks, numbers and tokens of less than three charac- 
ters. The basic idea is to select relevant tokens to be used in the bag-of-words repre- 
sentation. These tokens will be called features. 

The text pre-processing procedure implemented in WebClassll is: 

1. Removal of words with high frequency, or stopwords, such as articles, prepositions, 
and conjunctions. 

2. Removal of suffixes, such as those used in plurals (e.g., -s, -es and -ies), gerund (- 
ing), simple past (-ed), and so on. 

3. Determination of equivalent stems (stemming), such as “analy” in the words 
“analysis”, “analyses”, “analyze”, “analyzing” and “analyzer”. 

For the first step, stopwords used by WebClassll have been taken from Glimpse 
(glimpse.cs.arizona.edu), a tool used to index files by means of words, while for the 
last two steps, the algorithm proposed by Porter [13] has been implemented. 

Many approaches have been proposed in the literature on information retrieval for 
the identification of relevant words to be used as index terms of documents [15]. Most 
of them simply score words according to some measure and select the best firsts. 
However, techniques proposed for information retrieval purposes are not always ap- 
propriate for the task of document classification, also known as text categorization. 
Indeed, we are not interested in words characterizing each single document, but we 
look for words that distinguish a class of documents from other classes. Generally 
speaking, the set of words required for classification purposes is much smaller than the 
set of words required for indexing purposes. 

The feature selection algorithm implemented in WebClassll is based on a variant of 
TF-IDF. Given the training document d of the i-th class, for each token t the frequency 
TF(i,d,t) of the token in the document is computed. Then, for each class i and token f, 
the following statistics are computed: 

- MaxTF(iJ), the maximum value of TF(i,d,t) on all training documents d of class /; 

- PF(i,t), the page frequency, that is, the percentage of documents of class i in which 
the token t occurs. 
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The union of sets of tokens extracted from documents in one class defines an “em- 
pirical” class dictionary used by documents on the topic specified by the class. By 
sorting the dictionary with respect to MaxTF(iJ), words occurring frequently only in 
one document might be favored. By sorting each class dictionary according to the 
product MaxTF(i,t)*PF(i,tf , briefly denoted as MaxTF-PF^ (Max Term Frequency - 
Square Page Frequency) measure, the effect of this phenomenon is kept under control. 
Moreover, common words used in documents of a given class will appear in the first 
entries of the corresponding class dictionary. Some of these words are actually spe- 
cific to that class, while others are simply common English words (e.g., “information”, 
“unique”, “suggestion”, “time” and “people”) and should be considered as quasi- 
stopwords. In order to move quasi- stopwords down in the sorted dictionary, the 
MaxTF-PF^of each term is multiplied by a factor ICF=l/CF(t), where CF(t) (category 
frequency) is the number of class dictionaries in which the word t occurs. In this way, 
the sorted dictionary will have the most representative words of each class in the first 
entries, so that it will be sufficient to choose the first N words per dictionary, in order 
to define the set of attributes. Once the class dictionaries are determined, a unique set 
of features is selected to represent documents of all classes. Once the set of features 
has been determined, training documents can be represented as feature vectors of term 
frequencies. 



4.2 The Classification Method 



Currently, WebClassll has two alternative ways of assigning a Web page to a cate- 
gory: 



1. By computing the similarity between the document and the centroid of that cate- 
gory. 

2. By estimating the Bayesian posterior probability for that category (naive Bayes). 



Therefore, a training phase is necessary either to compute the centroids of the cate- 
gories or to estimate the posterior probability distributions. In the following, only the 
naive Bayes method is illustrated because it outperforms the centroid method and it 
has been used for experiments. 

Let cl be a document temporarily assigned to category c. We intend to classify d 
into one of the subcategories of c. According to the Bayesian theory, the optimal clas- 
sification of d assigns d to the category c e SubCategories(c) maximizing the posterior 
probability PJ^c\d). Under the assumption that each word in d occurs independently of 
other words, as well as independently of the text length, it is possible to estimate the 
posterior probability as follows: 



Pcifi M) 



pac,)- n k . r *"’'"' 

we FeatSet^ 

c’sSubCategoriesic) weFeatSet^ 



( 1 ) 



where the prior probability PJcJ is estimated as follows: 
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TrainingipiJ 

^ {Training (c^ {Training (cj (2) 

c e SubCategories(c) 



and the likelihood PJw \ cj is estimated according to Laplace’s law of succession: 

1 + PF(w,C/) 






FeatSet^ 1+ J^PF(w',Ci) 
w'eFeatSet^ 



(3) 



In the above formulas, TF(w,d) and PF(w,c) denote the absolute frequency of w in 
d and the absolute frequency of w in documents of category c, respectively. The likeli- 
hood PJw I cJ could be estimated according to the relative frequency, that is: 



PAw\ci) = 



PF{w,ci) 

Y,PF{w,ci) 

w'e.FeatSet^ 



(4) 



The above formulation of the naive Bayes classifier assigns a document d to the 
most probable or the most similar class, independently of the absolute value of the 
posterior probability. By assuming that documents to be rejected have a low posterior 
probability for all categories, the problem can be reformulated in a different way, 
namely, how to define a threshold for the value taken by a naive classifier. Details on 
the thresholding algorithm are reported in [5]. 

The classification process of a new document is performed by searching the hierar- 
chy of categories. The system starts from the root and selects the nodes to be expanded 
such that the score returned by the classifier is higher than a threshold determined by 
the system. At the end of the process, all explored categories (either internal or leaf) 
are considered for the final selection. The winner is the explored category with the 
highest score. If the document is assigned to the root, then it is considered rejected. 



5 Experimental Results 

A user/trainer of WISDOM-h- is asked to label some layout components of a set of 
training documents according to their logical meaning. Those layout components with 
no clear logical meaning are not labeled. Therefore, each document generates as many 
training examples as the number of layout components. Classes of training examples 
correspond to the distinct logical components to be recognized in a document. The 
unlabelled layout components play the role of counterexamples for all the classes to be 
learned. In particular, we have considered a set of nineteen full text multi-page docu- 
ments, which are scientific papers on biomedical topics. This kind of text corpus is 
particularly suited for layout based document understanding and from the other hand 
highlights the need to support it with content information. Indeed, they are character- 
ized by a regular section structure both from the geometrical viewpoint and from the 
content distribution viewpoint. In particular, the text of each article is distributed on 
five different and recurrent types of sections: abstract, introduction, methods, results 
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and discussion. We processed 103 document images in all, each of which has a vari- 
able number of layout components. In particular, we use 51 document images for 
training the learning system integrated in WISDOM-H-, that is ATRE [10], on logical 
labelling and 24 for testing the induced set of rules. Layout components can be associ- 
ated with at most one of the following logical components: title, authors, abstract, 
section and section_title. In Table 1 logical components distribution on the processed 
documents is shown. In particular, the total number of logical components is 790 (583 
of which are undefined) about 49 logical descriptors for each page document. 



Table 1. Training set. Distribution of pages and examples per document. 



Name of the 

multi^page 

document 


No. of 
pages 


No. of title 
labels 


No. of 
authors 
labels 


No. of 
abstract 
labels 


No. of 
section 
labels 


No. of 
section- 
title labels 


Total no. 
of 

examples 


Article_l 


6 


1 


1 


1 


21 


3 


94 


Article_2 


4 


1 


0 


1 


10 


3 


82 


Article_3 


6 


1 


1 


1 


18 


6 


81 


Article_4 


3 


1 


1 


1 


8 


4 


46 


Article_5 


3 


1 


1 


1 


5 


4 


35 


Article_6 


7 


1 


1 


1 


14 


5 


69 


Article_7 


7 


1 


0 


1 


16 


4 


88 


Article_8 


4 


1 


1 


1 


11 


3 


55 


Article_9 


4 


1 


1 


1 


9 


3 


156 


Article_10 


2 


1 


1 


1 


4 


3 


23 


Article_l 1 


5 


1 


1 


1 


17 


5 


61 


Total ftraininrt 


51 


11 


9 


11 


133 


43 


790 



In order to test the predictive accuracy of the learned theory, we considered 3 arti- 
cles whose distribution is reported in Table 2. WISDOM-t-H- segmented the 24 pages in 
482 layout components. To WISDOM-h- is asked to use the learned theory to label 
layout components as generic sections or section_title as well as title, authors and 
abstract sections. 

Table 2. Testing set. Distribution of pages and examples per document. 



Name of the 

multi-page 

document 


No. of 
pages 


No. of 
title labels 


No. of 
authors 
labels 


No. of 
abstract 
labels 


No. of 
section 
labels V 


No. of 
section- 
title labels 


Total no. 
of 

examples 


Article_12 


10 


1 


1 


1 


41 


4 


208 


Article_13 


5 


1 


1 


1 


15 


4 


108 


Article_14 


9 


1 


1 


1 


33 


3 


200 


Total (testinl3 


24 


3 


3 


3 


89 


11 


516 



In Table 3 commission and omission errors performed on the set of testing docu- 
ments are showed. A commission error occurs when a wrong labelling of logical com- 
ponents is “recommended” by a rule, while an omission error occurs when a “correct” 
labelling is missed. 
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Table 3. Commission and omission errors performed by learned rules. 



Rule for 


No. omission errors 


No. eommission errors 


logic_type(X)=title 


0 


5 


logic_type(X)=authors 


0 


2 


logic_type(X)=abstract 


0 


2 


logic_type(X)=section 


0 


39 


logic_type(X)=section_title 


0 


1 



As showed no omission errors are performed while some commissions are found 
especially for sections components. This is due to the heterogeneous layout of generic 
section components that produces an heterogeneous set of examples and so a set of 
rules that is quite specific. 

Using layout-based labelling, to WebClassIl is asked to classify the text of a ge- 
neric section as either introduction, or materials and methods, or results, or discussion, 
and also as abstract even if generally the layout based classification it is sufficient to 
correctly classify abstract components on geometrical and spatial criteria. Therefore, 
the date source used in this experimental study is an ontology composed of five cate- 
gories (i.e. Abstract, Introduction, Materials & Methods, Discussion and Results), (see 
Fig. 2). These categories correspond to the five recurrent types of sections according 
to which biomedical corpora are generally structured. For each category, there are 19 
sections one for each of the 19 documents (e.g. the Abstract category contains 19 
abstract sections, the Introduction category contains 19 introductions and so on). The 
documents have been partitioned into five subsets, four of which compose the training 
set while the fifth subset represents the testing set. 




Fig. 2. Documents Ontology. 



Classification accuracy has been evaluated by means of precision and recall. The 
precision for a category c measures the percentage of correct assignments among all 
the documents assigned to c, while the recall gives the percentage of correct assign- 
ments in c among all the documents that should be assigned to c. Also the misclassifi- 
cation error (ncErr) has been computed, it computes the percentage of documents in c 
misclassified into a category c’not related to c in the hierarchy. 
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Results of the classification process are reported into Fig. 3. The graphs show that 
precision and recall increase when the features number is variable into the range from 
15 to 30. Instead, the system reaches the best precision values for the Methods & Ma- 
terials and Results classes. Nevertheless, Abstract, Introduction and Discussion sec- 
tions are much more difficult to classify, because generally these sections are devoted 
to general topic of the document. 




Fig. 3. Experimental results. 



6 Conclusions and Future Work 

This work presents a starting step towards a document management framework that 
can take advantage of different types of knowledge (layout and textual features of 
different level of abstraction) to solve the problem of the mismatch between user re- 
quests (based on high-level abstract concepts) and the way in which these requests are 
satisfied (low-level features for indexing) during an information retrieval process. 
Indeed, in further extension of our DIA framework we plan to explore the application 
of information extraction techniques in order to extract useful information from the 
text. In particular, nowadays there are several methods for information extraction of 
biological information from scientific articles and generally domain experts recognize 
without ambiguity in which part of a paper relevant data are located [8]. A formaliza- 
tion of the domain expert knowledge promises to lead to a new generation of docu- 
ment management systems that could learn about distribution of semantics in papers 
and build indexing models based on really relevant keywords. 

Moreover, we plan to conduct more extensive evaluations of the “semantic” classi- 
fier. We are also interested in investigating the application of machine learning tech- 
niques to reading order detection as further knowledge acquisition step in the work- 
flow from the logical to the semantic structure extraction. 
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Abstract. This paper presents a semi-supervised document image clas- 
sification system that aims to be integrated into a commercial document 
reading software. 

This system is asserted like an annotation help. From a set of unknown 
document images given by a human operator, the system computes re- 
grouping hypothesis of same physical layout images and proposes them 
to the operator. Then he can correct them, validate them, keeping in 
mind that his objective is to have homogeneous groups of images. These 
groups will be used for the training of the supervised document image 
classifier. Our system contains N feature spaces and a metric function 
for each of them. These allow to compute the similarity between two 
points of the same space. After projecting each image in these N fea- 
ture spaces, the system builds N hierarchical agglomerative classification 
trees (hac) corresponding to each feature space. The proposals for re- 
groupings formulated by the various HAC are confronted and merged. 
Results, evaluated by the number of corrections done by the operator 
are presented on different image sets. 



1 Introduction 

Recent improvements in pattern recognition and document analysis led to the 
emergence of applications that automate document processing. From a scanned 
document, some software are able to read its handwritten or machine printed 
content or to identify some symbols or logos. Others can retrieve the category 
(later on called “class”) from which it belongs. However, a training step is neces- 
sary while a human operator gives image samples with the same layout for each 
class. Generally these images are representative of the stream to sort. 

For example, sorting incoming mails in companies allows to redirect an un- 
known document to the right department or to apply an appropriate processing 
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depending on its document class [1] [2]. However, these softwares are not able 
to extract all the information on the image yet and a human operator has to 
define the tasks that have to be accomplished by the software depending on the 
document class of the image. 

The proposed approach improves the functionnalities of an existing software 
(A2iA FieldReader). At the beginning, this application was able to read hand- 
written and machine printed fields on documents coming from a homogeneous 
stream of documents. All of them were sharing the same reading model. Then, 
a supervised document classifier was added, allowing to process documents from 
several classes: after a training step, the system was able to find the class of 
an unknown document. The reading model of each class, containing position, 
type and semantic of the fields to read drives reading module. The supervised 
classifier must automatically find the most discriminating feature set for any 
set of images and any number of classes because the users are not specialists 
in image analysis. Another difficulty is that a human operator has to give few 
samples of document image per class to constitute a training database for the 
supervised classifier. This task becomes quickly difficult or even impossible if the 
database is composed of images coming from tens of different document classes 
and all the more if images of different classes have small layout differences. So, 
the training databases contain usually only a few samples. The classification 
method presented in [3] proposes a solution for these constraints. 

In this article, we propose a semi-supervised classification system inspired 
by Muslea and al. [4] that aims to be a help for annotation without a priori 
on the number of classes and their characteristics. Then, it is difficult to know 
which features are discriminating for a given set of images and classes. From 
a document image set of a heterogeneous stream, the system proposes to the 
operator some groups of images with the same layout. Thanks to a GUI, the 
operator can validate or correct these propositions. Few corrections are allowed: 
semi-automatic merging or splitting of groups, adding or removing documents 
in proposed groups. 

In section 2, we briefly present a few methods of unsupervised classification 
and justify our choice of the hierarchical agglomerative classification method. We 
describe our semi-supervised algorithm in section 3. Then, results on five different 
image databases are presented in section 4. Conclusion and future improvements 
are mentionned in section 5. 

2 Unsupervised Classification Algorithms 

A state of the art of unsupervised classification can be found in [5], [6] and [7]. 
We remind here the main methods. 

The K-means algorithm provides the best partition of a set E in k groups of 
elements well agregated and well separated in the feature space but our system 
must work without the knowledge of the expected number of class because in 
most of cases even the operator does not know it. 

Self organising maps are based on a neural network with neighbourhood 
constraints. They do not need the knowledge of the expected number of class 
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but a big number of samples is necessary to make them converge. We can also 
notice that this convergence is not guaranteed for feature vectors with a size 
greater than one [8]. 

Hierarchical Agglomerative Classification (hac) is an algorithm allowing to 
get a hierarchy of sets of the considered data and have the interest to propose 
a data structure without knowing the number of expected classes. The result 
is a tree where each node represents a group and the root contains the whole 
elements. Various existing criteria allow to cut some edges of the tree and to 
make groups with the elements contained in the descendant-nodes [8]. 

Among these three classical methods of unsupervised classification, the HAC 
seems to be the most interesting one to resolve our problem. Indeed, the draw- 
back of the computiong complexity in compensed because only few samples are 
used. As HACs only need the definition of a distance they may be built with 
numerical, syntactic or structural data. On the other hand SOM will lack sam- 
ples to garantee the convergence and the K-means method needs the number 
of expected classes. However, all of them work from numerical data extracted 
from the images of the training set. These images, often noisy, will introduce 
variability in the features. To correct these errors the introduction of a semantic 
level would be appropriate like extracting well identified graphical objects (such 
as boxes, titles, combs, etc.). This solution introduces a bias we forbid because 
it will lead to develop a big database of concurrent extractors. Our idea is to 
have few feature spaces in which we will project the images and build a HAC 
tree for each space. Having a big feature vector, result of the concatenation of 
some vectors bring us back to the problem just evoked. So we will get as many 
HAC trees as feature spaces. These features are different: visual (seeking white 
or black zones of the image, average grey value, etc.), structural (lines counting, 
rectangle extracting, etc.) and statistics (differential of connected components 
size, etc.). Each hac will voice regrouping hypothesis that will be all exploited 
in parallel to finally find out the groups that must be submitted to the operator. 

3 Multi-view hac Algorithm 
3.1 Few Definitions 

Let ImageSet be the training set. Let FeaturesSet be the available feature 
space set. For any feature space E, a function Fe that projects an image in E 
is defined by: 

E G FeaturesSet, Fe : ImageSet E 

For any feature space E, a function Me that computes the distance between 2 
points of E is defined by: 

E G FeaturesSet, Me : E x E ^ K“'' 

For any feature space E, a function De that computes the distance between 2 
images of ImageSet is defined by: 

E e FeaturesSet, De ■ ImageSet x ImageSet K“'' 
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E £ FeaturesSet, (IijE) £ ImageSet^ , De {Ii^E) = Me {Fe (/i) , Fe (h)) 

The function denoted as Fe of the feature space E projects an image in the 
space E. The function denoted as Me computes the distance between two points 
of the feature space E. To simplify the notation, we will note De the function 
that computes the distance between two images in the feature space E. 

3.2 Building a HAC Tree 

Here is the building algorithm of an HAC tree for a given feature space E: 

1. Initialize a list L with one group per image of ImageSet 

2. Compute the distance between all images of ImageSet 

3. Merge in a group G, the two closest groups A and B 

4. Remove A and B from L and add G to L 

5. Compute the distance between G and all groups of L 

6. If L contains more that one group, go back step 3 

This algorithm needs to define two distances. The first one has to compute 
the distance between two images (step 2): it is the distance De defined in 3.1. 
The other one has to compute the distance between two groups of images (step 
5). It can be defined by: 

— Diameter of the G U G' set. The choice of this distance allows to have a 
measure of the variability of the GUG' group: Maxima, i'gG'{De{I, I')) 

— Minimal distance between points of each group: Minima, i'gG'{De{I, I')) 

— Average distance between the points of the union of the two groups: 

X)/,jeGuG',/#J ^e{I, J) 

IIgug'II 

When a G group is created (step 3) with the two closest groups A and B, 
the distance between A and B is also the heigth of the G group. That is why 
this tree structure is often represented by a dendogram. 

The algorithm stops when only one group is remaining. This group is called 
the root of the HAC tree and contains all the images. 

So, our algorithm builds as many HAC trees as available feature spaces in the 
system. 



3.3 Extraction of Grouping Hypothesis Common to Different HAC 

The system has now several HAC trees that represents different structures of the 
same data. For any pair of HAC trees, we extract every groups (nodes) containing 
the same images in two trees. These groups can be considered as regrouping 
hypothesis shared by different points of view. We will denote Select, the set of 
the nodes appearing in at least two HAC. The system has now a set of groups 
shared by several HAC trees, so a priori the most reliable groups. 
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3.4 Building the Minimal Inclusion Forest 

The system establish hierarchical links between the nodes of the Select list as 
following: the father of a given node N is the smaller node containing N . The 
result is a forest F (set of trees) . 

Figures 1 and 2 presents two inclusion forests. Each group (node) contains 
its image list with the following syntax: [C]_[iV] with C as the identifier of the 
class and N the identifier of the image inside the class. The coloured nodes are 
homogeneous (images of the same class) and the nodes with white background 
contain images from different classes. 




Fig. 1. Forest of DBl. hag built with the Min distance. 






Fig. 2. Forest of the DB4. hac built with the Max distance. 



The forest of figure 1 contains two trees, non homogeneous. For each class, 
we can find a node containing every images of the class. 

The forest of figure 2 contains thirteen trees but only one is not homogeneous. 
This node contains images of two different classes. One class has given two roots 
without link between them (class 11, left of the image). 

3.5 Presenting the Forest to the Human Operator 

For each tree of the forest, the contained images of a group are presented to the 
operator in an array of thumbnails. In front of a G group, the operator can: 
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— Validate G if the images are from the same class. The group is ready for a 
possible merge with another group. 

— Reject G if it contains images different layout structure. In that case, the 
system removes G and presents the groups of the descendant nodes of G to 
the operator. Experimentally, that case is frequent because the structure of 
the groups is done with numerical heuristics so the probability that a group 
is homogeneous decreases when its size increases. 

— Merge G with another group G' if the images of G and G' are of the same 
over-segmented class. Beforehand these groups have to be validated. The 
system replace G and G' by a G” group, union of the images contained in G 
and G' . It is the case when only a part of the images of a class has the same 
default. For example, a black logo can be whitened or not by an adaptative 
binarization. It seems natural that the algorithm separates the images in two 
sub-classes if the logo is whitened or not. 

4 Results 

The results for five image sets and for one database regrouping four sets are 
presented in Table 1. Sample images are presented in Fig. 3, 4, 5, 6 and 7. 
The image classes are composed by a random number of images (from 3 to 10) 
randomly drawn in a database containing thousands of images. 



Table 1. Operator cost to get homogeneous groups. 





DB 0 


DB 1 


DB 2 


DB 3 


DB 4 


DB 1,2,3, 4 


^Images (total) 


15 


33 


31 


31 


70 


165 


#Classes 


2 


8 


6 


6 


15 


35 


Rejects 


0 


3 


5 


5 


1 


6 


Merges (after validation) 


0 


0 


1 


0 


1 


3 


Classified Images 


100% 


100% 


100% 


100% 


81% 


99% 


Well Formed Classes 


100% 


100% 


100% 


100% 


93% 


100% 



We call “Classified Images”, the part of images inside the inclusion forest. For 
example, for the database 4, “81% of the 70 images have been classified” means 
13 images are not inside the inclusion forest. Experimentally we have noticed 
that these images have significant default compared to the other images of the 
class. 

We call “Well Formed Classes”, the part of classes found by the classifier. 
For example, for the database 4, “93% of the 15 classes have been well formed” 
means that one class has not been retrieved. 

At the end of the corrections done by the operator, the system will learn the 
images of the validated groups. It will remind the operator that some images 
have not been learned because they were not included in any of the trees and 
will try to classify them with the approval of the operator. Then, the operator 
can finish the configuration of the learning classes. 
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Fig. 3. Image Samples of DBO. 




Fig. 4. Image Samples for the 8 classes of DBl 
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Fig. 5. Image Samples for the 6 classes of DB2. 




Fig. 6. Image Samples for the 6 classes of DBS. 



5 Conclusion 

This article presented an effective technique of semi-supervised classification. We 
tried to introduce a multi-view notion with different feature spaces to prevent 
the blindness due to purely numerical considerations induced by the HAC trees. 
On the other hand, the HAC trees free us from the problem of the form of the 
clusters in the different feature spaces and their number, information that even 
the operator does not know. The performance criteria depending in fine of the 
number of corrections the operator has to make to get homogeneous classes, we 
have to consider carefully the way to present the results of our algorithm to the 
operator. 

As most of the unsupervised classification systems, after computing the dis- 
tances between the images, we do not exploit the images anymore whereas they 
are shown to the operator. It could then be judicious to design an algorithm 
which would automatically extract a set of graphical objects as well as their 
neighbourhood relationships. The system would justify the presentation of a 
group to the operator by the presence of these objects as well as the validation 
of their neighbourhood relationship on all images of the group. Graphical objects 
could be extracted without a priori knowledge not to bring back to the problems 
evoked in section 2 but with a simple geometric severe criteria in order to limit 
errors. 

Morever, it would be interesting to try to cut the trees of the forest in order to 
directly present homogeneous image groups to the operator. However, tests were 
carried out on these bases with various cut criteria but all of them were more 



Multi- view hac for Semi-supervised Document Image Classification 



199 




I 

















' mmti 

iTTrTTTTi • > I ■ > 

■ I I.i . Ml , nj.l 
»Lf . "I. njn 



.Li;' 




Fig. 7. Image Samples the 15 classes of DB4. 
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expensive for the operator than by presenting the inclusion forest. We can then 
think that it is illusory to try to make homogeneous groups automatically. Indeed 
let us recall that this system help the operator to quickly form classes of images 
for the training of the document sorting software. Thus if automation creates 
errors on the learned classes, the consequences are serious on the effectiveness 
of the document classification system. 
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Abstract. This paper presents our work on locating and removing unwanted text 
stamps within archive documents which are being prepared for OCR. Text stamps 
mainly comprise one or several text lines with a fixed shape, font size and colour, 
and may appear anywhere on the document with variable orientation and overlap 
of other text fields. We apply a configurable user interface to register features 
of a sample stamp (such as corners, font-size and print colour) as a template 
using fuzzy rules, and then analyse each document image to find matching stamps 
using fuzzy functions as a classification mechanism. The configurable interface 
allows the user to decide which and how many features should be used to describe 
the target stamp. Evaluation was very encouraging. We tested 1,241 specimen 
index cards from a biological archive card index, and achieved 92-95% correct 
detection rate and 85-95% complete removal rate. 



1 Introduction 

Identification of fixed symbols (e.g. seal imprints, logos and signatures) attracts a lot of 
attention in the field of document image analysis. [1] [2] [3] [4] present several methods 
of identifying seal imprints, while [5] [6] [7] [8] present approaches for verifying sig- 
natures. [9] and [10] provide different ways of recognizing logos. Most methods follow 
three steps: template registration, detection and classification. 

- Template registration records modeled features of the target symbol. Feature mod- 
eling varies widely. [1] [2] [3] [4] [5] [6] [7] [9] [10] calculate a set of constraints 
for features by using formulae. [11] uses a neural network to train a set of weights 
for features. [8] uses fuzzy sets to model features. 

- Detection aims to distinguish the potential target symbol from other content of the 
document. In most document images, symbols are mixed with other components 
e.g. text. Therefore, some measures have to be taken to narrow the target. [4] sug- 
gests a method using color information to separate signatures and seal imprints 
from bank check backgrounds. [10] suggests using the size of the symbol block to 
reduce the potential candidates. 

- Finally, classification is the matching process, which compares the candidate with 
the template. 

We are currently analysing archive documents, such as specimen index cards and li- 
brary index cards. These frequently contain stamps, consisting of several text lines with 
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specific shape, font size and color. These stamps are not useful elements of the archive 
documents, and should be removed to reduce errors in extracting useful textual infor- 
mation. In this paper, we suggest a method to identify stamps by using fuzzy logic [12] 
[13]. We use fuzzy terms to model features of corners of the stamp with a configurable 
user interface, and then use the method suggested in [10] to narrow down the target 
document areas. Finally, we match the candidate with the template by using a fuzzy 
inference mechanism to overcome variations in printing of the stamp. 

Our system has several advantages. It is flexible and adaptable for different ap- 
plications and datasets, as the configuration and features of each stamp to be detected 
are defined visually by the user, and then mapped to fuzzy rules, rather than hard- 
programmed. By using fuzzy logic, parameterised via the user interface, the system can 
handle stamps and document sets with widely differing formats and characteristics re- 
liably. 

The paper is organised as follows: Section 2 starts by reviewing our overall system, 
and then outlines the features we extract to help identify and locate the stamps. Section 
3 describes our segmentation method. Section 4 describes how configurable templates 
are constructed by users of the system, and how template attributes are mapped to fuzzy 
logic parameters. Section 5 explains how the fuzzy logic parameters are matched against 
features extracted from test archive images, enabling the presence of a stamp to be de- 
tected and the stamp removed. Section 6 describes how the system was evaluated, and 
presents the results obtained. Section 7 draws conclusions and discusses further work. 

2 System Overview and Stamps Feature Analysis 

2.1 System Overview 

Index cards have a consistent physical format (typically 5" x 3" or 6" x 4" thin card) 
but a wide range of handwritten, typed, printed and mixed text content. Fig. 1 shows 
a variety of formats found within the index cards at the UK Natural History Museum 
(NHM), which range from highly-structured to relatively free-format text. More than 
500,000 cards have now been digitized using a scanner system first discussed in [14], 
and are progressively being made accessible online [15]. OCR of the content of these 
cards varies from difficult (where high quality typed or printed text is used, but with 
specialized scientific lexicons and non-standard abbreviations) to extremely challenging 
(similar lexicons but handwritten and amended text). Although the overall variation over 
a complete dataset (hundreds of thousands of cards) is large, substantial batches of cards 
within the dataset are often similarly constructed, making it possible for a taxonomist 
to improve OCR by defining batch-specific card formats using a few user-defined tem- 
plates. Processing then involves anything from recognizing a batch containing a single 
format of card (and single corresponding card template), through to processing a mixed 
batch of cards with several corresponding templates. Our system is therefore designed 
to support the process of user-assisted template configuration for card index document 
analysis and recognition. An important sub-component of this system identifies and re- 
moves validation stamps, which are found on many index cards but do not contribute 
to the useful textual content of the card. The stamp identification and removal system 
comprises four parts, segmentation, template registration, identification and removal. 
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Fig. 1. Example card formats from the Natural History Museum archive card index, showing four 
commonly encountered validation stamps. 

2.2 Feature Extraction 

When humans try to identify a document pattern, the global features first attract their 
attention and then the local features. For archive validation stamps, the global features 
are generally colour and structure, and the local features are the textual content. In this 
research, we identify stamp patterns on archive cards by feature extraction of global 
colour and structure. 

2.3 Stamp Colour Analysis 

Validation stamps on archive cards are produced by pressing the stamp onto an inkpad 
and then onto the card. Therefore stamps images are normally a single colour (black, 
blue, red or green) according to the colour of the inkpad. If the inkpad colour differs 
from the predominant text colour of the archive card, then simple colour-based seg- 
mentation of the elements of the card may be sufficient to identify the stamp uniquely, 
enabling it to be removed from the remainder of the card content. Commonly however, 
both the stamp and the typed or handwritten card text use the same colour ink, making 
structural analysis necessary to identify the stamp pattern, even when colour segmenta- 
tion has been applied to simplify structural analysis. We discuss colour map detection 
for archive documents in a separate paper submitted to DAS2004 [16]. 

2.4 Stamp Structure Analysis 

Four different validation stamp styles have been found in our archive documents, as 
shown in Fig. 2. 

We concluded that the global structure of these stamp patterns has two elements, 
inner structure and outermost boundary. The inner structure depicts the inner skeleton 
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Fig. 2. NHM validation stamps (stamps (a) and (d) are common, stamps (b) and (c) rare, so we 
evaluated our system only using (a) and (d)). 



of the patterns, which is represented by horizontal white spaces between text lines and 
vertical white spaces between words. The outermost boundary reflects patterns’ shape 
and size, where the shape depicts the outermost edges while the size indicates their di- 
mensions. However, by investigating many archive documents, we found that the inner 
structure is not reliable, as it often suffers from “inner damage” (due to the mechanism 
of pressing the stamp onto the card by hand) as shown in Fig. 3. 
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Fig. 3. Examples of inner structure damage to NHM validation stamps. 



Therefore we concentrated our efforts on utilising features from the outermost 
boundary, because it is more reliable. The stamp patterns hence can be represented 
by a polygon contour where corner points correspond to the ends of text lines. The 
lengths (e.g. Dl_2) and relative orientations (represented by internal corner angles, e.g. 
A3_l_2) of the sides of the enclosing polygon are sufficient to define the polygon and 
thus the outer boundary shape and size of the stamp. Fig. 4 shows the boundary contour 
and identified features for the stamp pattern shown in Fig. 2(a). A single feature, skew 
angle, is sufficient to define the absolute rotational orientation of the stamp, which be- 
cause of its mechanism of formation, is often not printed squarely on the archive card (in 
rare cases, it is even printed upside down). Finally, the variation in the fonts, fontsizes 
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and capitalization of the characters within each validation stamp can be represented by 
a “font height” feature (e.g. FI) recorded for each corner of the stamp. Here, we use 
the character height of the character at each corner rather than its width, because a font 
width feature would require accurate horizontal segmentation of characters, whereas 
font height only requires segmentation from non-overlapping adjacent lines. 




3 Stamp Structure Analysis 

Each text string on an archive card is considered as a fundamental element, which con- 
tains structural features. The purpose of stamp structure analysis is to segment all the 
text strings from the stamp color plane obtained previously, and then extract corner 
features from each text string. First of all binarization is applied to the stamp colour 
plane(Fig. 5(a)). Then binary connected component bounding boxes are used to group 
the foreground pixels of the image into features. No filtering or smearing is applied be- 
fore binary connected component analysis, hence the bounding boxes obtained will cor- 
respond to individual characters or groups of touching characters, as shown in Fig. 5(b). 
To avoid black pepper noise, we then apply a filter to remove bounding boxes smaller 
than 3 pixels square. The filter is adjustable according to the application through the 
interface. Adjacent bounding boxes are then accumulated to generate short text strings, 
representing words or phrases. Two situations can arise in forming the text strings. The 
simplest case is when the stamp colour plane contains only strings corresponding to 
the stamp image, because all other text strings are contained in a different colour plane. 
Commonly however, text strings may or may not belong to the stamp, because the stamp 
printing is the same colour as other text in the archive image. In this latter case, all the 
short text strings, and all their corresponding potential corners, need to be identified, 
and the positional relationships between text corners are then used to identify and elim- 
inate string configurations which can not represent stamps. 

A short text string is accumulated from left to right. If the horizontal distance be- 
tween two bounding boxes is less than twice the initial character width of that string 
and they overlap vertically by more than half the initial character height, we consider 
that these two bounding boxes are adjacent to each other. The initial values for character 
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Fig. 5. Examples of binarized stamp colour plane (a) and its connected components (b). 



width and height are successively replaced by an average of those of previously joined 
bounding boxes, allowing the assumed character size to be progressively updated along 
a text string. The process will be terminated when no further adjacent bounding box is 
found. Then a new short string is initialized from the nearest bounding box. Repeating 
the whole process left-to-right and top-to-bottom, finally we have a set of short strings. 
The short strings are then further joined into longer strings by using a Hough Transform 
[17] [18]. The ends of the long strings are the identified “corners” of text lines, which 
are the primary feature used to identify instances of stamp templates occuring in each 
archive card image. 



4 Stamp Template Registration 

In this section, we indicate how the template for a stamp is constructed by using fuzzy 
logic to represent its structural features. First of all, the user selects a sample card image 
containing a representative sample of the stamp pattern using the system’s graphical 
user interface [19]. This stamp image is then cropped and converted into a binary image. 
Then, using the segmentation method described in Section 3 to obtain each text line 
of the stamp, the “corners” of the stamp are located. From the structure analysis in 
Section 2.4, three structural features, angle, distance and font, are identified for each 
“corner”. 

4.1 Structural Features Registration 

For feature angle, each inward angle of a stamp can be measured by drawing a closed 
polygon graph connecting all the corners of the template stamp. The average angle, 
Aaver = 180x(n-2), depends on n, the number of sides the stamp contour contains. The 
maximum Amax, minimum Amin and average corner angle Aaver are then used to scale 
the allowed fuzzy corner angle ranges of the stamp. Based on these values, a member- 
ship function for feature angle is defined using three ranges, “SMALL”, “MEDIUM” 
and “LARGE”. The peaks of each range correspond to Amin, Aaver and Amax respec- 
tively, and any angle whose value is outside the range [Amin, Amax], is considered as 
“NULL”. In the case of stamp Fig. 2(a), the membership function is defined as shown 
in Fig. 6(a), where Amax=lS0, Amin=6Q and Amax= 120. Similarly, the membership 
function for features distance and font are also defined with three ranges, “SHORT”, 
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Fig. 6. Examples of membership function for feature angle (a) distance (b) and/onf(c). 



Table 1. Example of template expression for stamp structural features angle, distance and font. 



Features 


Cl 


C2 


C3 


C4 


C5 


C6 


Angle 


A3.L2 


Al_2_3 


A5.3.1 


A2.4.6 


A6.5.3 


A4_6.5 


Fuzzy expression 


MEDIUM 


MEDIUM 


SMALL 


SMALL 


LARGE 


LARGE 


Distance 


D1.2 


D2_4 


D4.6 


D6.5 


D5.3 


D3_l 


Fuzzy expression 


LONG 


SHORT 


MEDIUM 


MEDIUM 


MEDIUM 


SHORT 


Font 


FI 


F2 


E3 


E4 


F5 


F6 


Fuzzy expression 


MEDIUM 


MEDIUM 


MEDIUM 


MEDIUM 


MEDIUM 


MEDIUM 



“MEDIUM”, “LONG” and “SHORT”, “MEDIUM”, “TALL”. In the case of stamp 
Fig. 2(a), the membership functions for these features are shown as Figures 6(b) and 
(c) respectively. 

Using the membership function, the measured values of angle, distance and font for 
any configuration of corners corresponding to a possible stamp can be mapped into their 
corresponding fuzzy representation. The template fuzzy expression for stamp Fig. 2(a) 
is, for example, represented as Table 1 . 



4.2 Registration of Other Important Stamp Parameters 

Beside the structural features, there are some other important global stamp parame- 
ters; these are the overall size of a stamp SZt(x,y) (in pixels) and maximum vertical 
WHPy and horizontal WHPx inner white space (in pixels). These parameters are used 
for stamp isolation, as mentioned in the next section. 
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Fig. 8. Example of stamp. 



4.3 Template Configuration 

All the comer features (angle, distance and font) and global parameters are visually 
identified and represented on the graphical user interface Fig. 7. Curators can therefore 
easily understand both the graphical significance and linguistic represenfation of each 
feature. This allows curators to change the attributes of each feature and parameters so 
that the stamp removal system can readily be reconfigured for use wifh ofher sfamps 
having different formats and atributes. 



5 Identification and Removal 

5.1 Stamp Isolation 

Through colour segmentation, the stamp colour plane can be obtained. The colour plane 
is then binarized. In order to improve analysis accuracy and reduce complexity, we use 
the method suggested by [10], in which a binary connected component bounding box 
with smear values (WHPy and WHPx) is used to locate potential stamp blocks. In our 
application, if the size of the located block SZc(x,y) is similar to or bigger than that 
of the registered SZt(x,y), it is considered as a candidate stamp block. We then carry 
out stamp structure analysis on these isolated candidate stamp blocks to extract all the 
“corners” they contain, as described previously in Section 3. 

5.2 Matching 

The extracted “corners” are grouped in the same format as has been registered for 
each template. Features of each group can be extracted, and then converted into their 
corrsponding fuzzy representation by using the previously defined membership func- 
tions (e.g. fig. 6(a)(b)(c)). For instance, a group of “corners”, which is extracted from 
Fig. 8, has a set of structural features as shown in Table 2. We construct a fuzzy infer- 
ence mechanism to map these instance features to those registered for each template by 
deriving a set of fuzzy “IF-THEN” rules. 

The inference mechanism provides a weighting system to weight each of the fea- 
tures. For instance, for the feature angle, if an angle is classified as “SMALL” and ifs 
counterparf in fhe femplate is “LARGE”, it weighs 0%; if classified as “MEDIUM”, if 
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Table 2. Instance of fuzzy expression for a group of comers with stmctural features angle, dis- 
tance and font. 



Features 


Cl 


C2 


C3 


C4 


C5 


C6 


Angle 


LARGE 


MEDIUM 


NULL 


SMALL 


LARGE 


LARGE 


Distance 


LONG 


SHORT 


NULL 


MEDIUM 


NULL 


SHORT 


Font 


MEDIUM 


MEDIUM 


MEDIUM 


MEDIUM 


MEDIUM 


MEDIUM 



Table 3. Weighting table for feature angle. 



Angle 


SMALL 


MEDIUM 


LARGE 


SMALL 


100% 


50% 




MEDIUM 


50% 


100% 


50% 


LARGE 


0% 


50% 


100% 


NULL 


0% 


0% 





weighs 50%; if classified as “LARGE”, it weighs 100%; if classified as “NULL”, if also 
weighs 0%. Table 3 shows fhe full weighting table for feature angle. Other features can 
also be weighted in the same way, therefore the example of Table 2 will be weighted as 
shown in Table 4. The weight for each “corner” Ci(A,D,F) is the average of the weights 
of its angle AftJ, distanceDftj and font F(i). The overall weight for the group 

of “corners” is the average of weight of C1(A,D,F),..., Ci(A,D,F). 



Table 4. Example of weighting for a group of comers structural features angle, distance and font. 





Cl 


C2 


C3 


C4 


C5 


C6 


Angle 


50% 




mmm 




100% 




Distance 


100% 




■ilM 




0% 




Font 


100% 




100% 




100% 




Weight 


83.3% 




33.3% 




66.7% 




Overall weight 


483.3%/6 = 80.55% 



If any group of corners is classified as “HIGH” (>75%) and its overall weight is the 
highest of all candidate stamps, this group of “corners” is considered to be the “corners” 
of the stamp. The stamp is therefore detected. 

5.3 Stamp Removal 

The output of the inference mechanism locates a group of “corners” which represents 
the detected stamp. Using the known position of the stamp’s corners, the maximum 
boundary of each text line of the stamp is easily located (each text line contains two 
corners, one at each end). The stamp can thus be removed by deleting all the pixels 
between each pair of corners from the input image, and (optionally) replacing them 
with background pixels (depending on what subsequent processing is required). 
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(a) (b) 

Fig. 9. Example of input image with Type I stamp (a) , stamp removed (b). 



6 Evaluation and Results 

6.1 Evaluation 

Through investigation, we found that over 90% of archive cards from the Natural His- 
tory Museum contain mainly two types of stamp patterns (Figures 3(a) and (d)). There- 
fore, we concentrated our evaluation on these two stamp types. Two sets of specimen 
index cards from the Natural History Museum were chosen for evaluation. The first set 
has 545 cards, 269 of which contain type I stamps (Fig. 2(a)), where the stamp is the 
same colour as most of other text (e.g. text which is hlack) on the card image. The sec- 
ond set has 676 cards, 513 of which contain type II stamps (Fig. 2(d)), which are of a 
different colour (e.g. red) compared with the remaining card image text. 



Table 5. Result of evaluation. 



Stamp Type 


No. of images 


No. of stamps 


Correctly detected 


Wrongly detected 


Complete removed 


Type I 


545 


292 


269/292(92.1%) 


1/545(0.2%) 


250/269(85.6%) 


Type II 


676 


513 


487/513(94.9%) 


1/676(0.1%) 


487/513(94.9%) 



6.2 Results 

The results shown in Table 2 are encouraging. 92. 1 % of Type I stamps were successfully 
detected and 85.6% of them were completely removed (Fig. 9(b)), while 94.9% of Type 
II stamps were successfully detected and all those detected were removed completely. 
The undetected errors for Type I stamps were caused either by serious fading (where 
most of the stamp has been lost, Fig. 10(a)) or where the stamp touches other text. 
Fig. 10(b). The undetected errors for Type II stamps were only caused by serious colour 
fading; as a result, the stamp colour plane couldn’t be separated properly in the hrst 
place, hence resulting in a damaged stamp. Some useful text is inevitably removed with 
the stamps in cases where the stamp touches other text of the same colour as shown in 
Fig. 11. These situations are classihed as incomplete removal, and result in the figures 
for stamp removal being lower than those for stamp detection for Type I stamps. Not 
surprisingly, the overall detection and removal rate for Type II stamps is higher than 
Type I, because Type II stamps are usually separated from any adjacent text as part of 
the initial colour plane segmentation. 
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Fig. 10. Example of (a) faded Type I stamp, (b) Type I stamp touching other text. 
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Fig. 11. Example of Type I stamp (a) touching other text, (b) text has been removed with the 
stamp. 



7 Conclusion 

In this paper, the use of fuzzy logic to model structural features of text stamps has been 
proposed. With fuzzy modeled features, the stamp can be detected in spite of many 
variations or defects in the features extracted (e.g. partial damage, shape transforma- 
tion). The application of a configurable interface for template enhance the flexibility 
of the system and makes it adaptable for processing widely varying batches of archive 
images. The evaluation results are very encouraging, with 92 %- 95 % correct stamp de- 
tection rate and 85%-95% complete removal rate being achieved. 
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Abstract. Portable document format (PDF) is a common output format for elec- 
tronic documents. Most PDF documents are untagged and do not have basic 
high-level document logical structural information, which makes the reuse or 
modification of the documents difficult. We developed techniques that identi- 
fied logical components on a PDF document page. The outlines, style attributes 
and the contents of the logical components were extracted and expressed in an 
XML format. These techniques could facilitate the reuse and modification of 
the layout and the content of a PDF document page. 



1 Introduction 

Portable document format (PDF) is a common output format for electronic docu- 
ments. PDF documents preserve the look and feel of the original documents by de- 
scribing the low-level structural objects such as a group of characters, lines, curves 
and images and associated style attributes such as font, color, stroke, fill, and shapes, 
etc. [1] However, most PDF documents are untagged and don’t have the basic high 
level logical structure information such as words, text lines, paragraphs, logos, and 
figure illustrations, which makes reusing, editing or modifying the layout or the con- 
tent of the document difficult. Although originally, the Portable Document Format 
(PDF) was designed for the final presentation of a document, there were trends to 
extend the capability of PDF to more than a viewable format [2] and to recover the 
PDF document layout and its content. In variable data printing (VDP), it is desirable 
to reuse the layout of an existing PDF page as a template or to reedit the content of an 
existing page by replacing the images, figure illustrations or modifying a body of text 
to create a new page or a new version of the page without going back to the original 
document. The goal of this work is to identify logical components and their associ- 
ated layouts and to extract the content of the components in a PDF document. The 
layout and style attributes of the logical components can be used as a template for 
creating new pages. 

In the past two decades, extensive studies have been done in discovering the layout 
of document images [3-8] for content understanding and extraction. Because of the 
limited information available in the bitmap images, for example there is no layering 
information of the page objects, most of the layout studies have been limited to busi- 
ness letters, technical journals and newspapers, where objects overlaps are minimal. 
Studies have also been done on multilayer analysis and segmentation of document 
images for the compression purpose. Scanned document images were broken into 
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different regions and different layers based on the texture of the objects on the page 
[13, 14]. Several groups have studied PDF documents layout and logical structure [9- 
12]. Brailsford et al. have been able to segment a PDF document page into different 
image and text blocks. Anewierden has reported his work on recovering the logical 
structure of the technical manuals in PDF. Hadjar et al. have developed a tool for 
extracting the structures from PDF documents. However, most of the studies gave 
limited consideration to overlaid and arbitrarily shaped blocks. 

In this paper, we present our techniques that automatically segmented a PDF docu- 
ment page into different logical structure regions such as text blocks, images blocks, 
vector graphics blocks and compound blocks. Each block represented a logical 
component of the document. We identified the geometrical outline of each com- 
ponent, extracted the style attributes and content of each component. The results of 
this analysis were expressed in an XML format. 

2 Document Layout and Content Analysis and Extraction 

PDF document content stream lists all the page objects such as text objects, image 
objects, path objects etc. [1] Path objects are referred to as vector graphics objects or 
drawings and paintings composed by lines, curves and rectangles. Page objects in 
PDF document don’t reflect nor are related to the logical structure or logical compo- 
nents of the document. For example, a text object may only have part of the charac- 
ters of a word; path objects which are the building blocks for the graphical illustra- 
tions, such as bar charts, pie charts and logos are often just a fraction of the whole 
figure illustrations e.g. one bar in a bar chart [16, 17]. To discover the logical compo- 
nents of a document, we need to analyze and interpret the layouts and attributes of the 
page objects so as to correctly break or group them into different logical components. 

2.1 Overview of the System 

Figure 1 shows our processing pipeline. In the documents with high graphical con- 
tent, there are often blocks overlay, e.g. a text block or a logo is embedded in an im- 
age, which makes the analysis more complicated. To minimize the potential logical 
components overlay and the interference between different logical components, we 
first separated the page into three layers: text, image and vector graphics layer. Each 
layer became an individual PDF document. To identify the logical structural compo- 
nent blocks in each layer, we could either directly analyze the three PDF documents 
or convert the PDF documents into bitmap images and perform document image 
segmentation. We then obtained the outline of each component block in polygon and 
extracted the style attributes and content of the component. For the text components, 
we extracted the font, size, line spacing and color as the text style and extracted the 
text strings as the content. For the image components, we identified the shapes of the 
images and the masks if they existed and extracted the data stream of each image into 
an image file. For the vector graphics components, we identified the path objects 
within each component, converted the path objects into the SVG path objects and 
created an SVG file for each vector graphics component. 
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Fig. 1. The overview of the processing pipeline. Each block represents a logical component or 
subcomponent of the document. 



2.2 Preprocessing 

Page objects can be simple types like text, image or path objects. They can also be 
compound types like container objects and form XObjects, which themselves may 
contain a list of objects including text, image and path objects or even another level 
of compound objects. In order to access all the objects on the page, objects within 
compound objects were extracted and replaced in the content stream list to substitute 
for the compound objects while the layout order of the objects was preserved. The 
extraction and substitution process continued until there were no more compound 
objects on the page display list. 



2.3 Separating the Document into Text, Image 
and Vector Graphics Documents 

After the compound objects were replaced by the simple-types objects, all the text 
objects on the page were extracted to form a new text-only PDF document, all the 
image objects were extracted to form a new image-only PDF document and all the 
path objects were extracted to form a new path-objects-only PDF document. The 
order of the objects placed on the page was preserved during this exercise. 

2.4 Text Segmentation 

For the text on the page, the PDF document provides information of all characters 
used in the text, their position and attributes, such as font, color, character spacing, 
orientation, etc. Directly analyzing the PDF document can provide more accurate 
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information than document image analysis and OCR. And the results are not affected 
by the contrast of the text, the background and text overlay. Therefore, our text seg- 
mentation was performed directly on PDF document. 




Fig. 2. The flowchart for text segmentation. 



As shown in Figure 2, based on a bottom up approach, text segmentation was 
started with the smallest text: from characters in PDF documents, words were formed 
using Adobe’s word-finder [20, 21]; from words, text lines were formed based on the 
alignment, style of the words and distance between them; and finally from text lines, 
text segments were formed based on text lines adjacency, styles and line gaps. 

2.4.1 Forming Text Lines from the Words 

PDF documents have the style and position information about characters used in the 
document. We used an attributes vector, which included font name, font size, color 
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and orientation of the words, to represent the style of the words. Adobe’s Acrobat 
word-finder provides the coordinates of the four corners of the quad(s) of the word. 
Quad is the quadrilateral bounding of a contiguous piece of a word. Based on the 
orientation of the words, all the words on the page were separated into two groups, 
horizontal and vertical words. Assuming the angle between the bottom line of the 
quad and the horizontal axis of the page was a and 0<a<90°, if 0<a<45”, the word 
was treated as horizontal word; otherwise, if 45”<a<90°, it was treated as a vertical 
word. Quads in each group were sorted based on the center positions of the quads on 
the page. 

Assuming: (x, y) were the coordinates of the center of the quads and PW and PH 
were the page width and height. For horizontal words, the sorting was based on the 
values of XH- y* PW and for vertical words the sorting was based on the values of y H- 
X *PH. Therefore, horizontal words were sorted from left to right first and then from 
top to bottom and vertical words were sorted from top to bottom first and then from 
left to right. After sorting, the horizontal words aligned at the bottom were listed next 
to each other. Similarly, vertical words aligned at the right were listed next to each 
other. 

Text lines were started with the first words of the horizontal and vertical words 
group respectively. The style of text line adopted the style of the word that started the 
text line. The next word on the list was selected and compared against the text line. 
The following criteria were checked: style, alignment and distance between the 
bounding box of the text line and the bounding box of the new word. The distance 
between two aligned boxes was the shortest distance between the east and west edge 
of the two boxes for bottom-aligned boxes and north and south edge of the two boxes 
for right-aligned boxes. If the word and text line had same style, if they were aligned 
at the bottom for the horizontal text or aligned at the right for the vertical text and if 
the distance between the bounding boxes of the word and the text line was less than a 
predetermined limit, e.g. space character or the maximum font width for the font used 
in the word and text line, then the word was added to the text line. If the next word on 
the list didn’t satisfy the criteria, another text line was started with this word. The text 
content within each text line was obtained by concatenating the sorted words in the 
text line. 

2.4.2 Text Line Analysis 

For text lines with same style, there is usually a consistent line gap between text lines 
that belong to the same text segment or same paragraph. A larger gap is usually pre- 
sent between two text segments. Text line analysis was performed to search for the 
consistent line gap for each text line style on the page. The values of the line gaps 
were used later in forming the text segments. 

After all the text lines were formed, the distances between any two adjacent text 
lines with the same style and orientation were measured. The adjacency referred to 
two text lines with no objects sitting in the gap between them. The value of the text 
line gap for a particular text style was the shortest vertical distance for the horizontal 
text and the shortest horizontal distance for the vertical text. The distances were 
measured between the bottoms of the current text line to the top of the next line. 
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2.4.3 Forming Text Segments from the Text Lines 

Text segments were formed by grouping adjacent text lines within close distance, 
with same style and orientation. A text segment was started with a text line. Using 
this text line as reference, all text lines that were adjacent or attached to this text line, 
had the same style as this text line and were with the distance shorter than a threshold 
to this text line were added to the text segment. The threshold was a function of the 
line gap and the font size for the particular text line style. We chose the threshold to 
be the minimal of the twice of the line gap and twice the height of the bounding box 
of the reference text line. The text lines that were added to the text segments were 
marked as the grouped text lines. Using the newly added text lines as references, with 
same criteria, more adjacent or attached lines were added to the text segment. This 
process was repeated until there were no more ungrouped text lines that could be 
added to the current text segment. Each text line could only belong to one text seg- 
ment. The next segment started with another ungrouped text line. The same process 
was performed to form all the text segments on the page. 

2.4.4 Combining the Text Segments 

Sometime, one or more words within a paragraph were highlighted by having differ- 
ent style such as italic while the majority of the words were regular; superscriptions 
or subscriptions might be present with different font size and different alignment 
from the main text. Previous steps performed for the text segmentation would identify 
them as individual text segments. Since our goal was to find the design intention for 
the logical blocks, which should be minimally affected by the current content, in this 
step, we combined the small text segments with the big one if the small text segments 
were embedded in the big one. The attributes or the style of the combined segment 
were the attributes or the style of the big text segment, which represented the style of 
the majority of text within the block. 

2.4.5 Tracing the Outlines of the Text Segments 

In the graphical rich documents, we often saw a text block wrapping around another 
arbitrarily shaped image or graphics block. It was important to find the outline of the 
text block so that the shape and the layout relation between blocks would be pre- 
served during text content replacement or modification. As shown in Figure 4, the 
outline of a text segment was identified by tracing the corner points of the bounding 
boxes of the sorted text lines within the segment. Text lines within a horizontal text 
segment were sorted from the left to the right first and then the top to the bottom. 
Text lines within a vertical text segment were sorted from the top to the bottom first 
and then the left to the right. To find the outline of a horizontal text segment, as 
shown in figure 4a, we first connected all the left corner points of the bounding boxes 
of the text lines within the segment till the last text line; then moved to the right bot- 
tom corner of the bounding box of the last text line; and connected all the right corner 
points. When connecting points on the left side, only a move down was allowed and 
when connecting points on the right side, only a move up was allowed. Processing 
steps were as follows: 
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Fig. 3. Tracing the outline of text segment by connecting the corner points of the bounding 
boxes of text lines, (a) For horizontal text lines, connections started with the text line on the 
top, moved down to connect all the left side points, then moved up to connect all the right side 
points of the bounding boxes of the text lines, (bl) When connecting the left side points, only 
moving down was allowed. (b2) when connecting the right side points, only moving up was 
allowed. (c)For vertical text lines, connection started with the text line on the most left, moved 
right to connect all the points on the top and then moved left to connect all the points at the 
bottom of the bounding boxes of the text lines, (dl) when connecting the points on the top, only 
moving right was allowed. (d2) when connecting the points at the bottom only moving left was 
allowed. 



Assuming origin is at the lower left corner, 

1 . Sort all boxes by the vertical positions of the centers from the top to bottom S„, fi,, ... 

2. Create two point-vectors L and R for the left and right points 

3. Fill L with the left top and bottom points of the bound boxes of the text lines, starting with 

the bounding box on the top: Lj, ...L^J=[Bg.L.top, B^.L. bottom, B^.L.top, B^.L.bottom, ... 

B^j.L.top, B^j.L.bottom] ; 

4. Fill R with the right bottom and top points of the bound boxes of the text lines, starting with 
the bounding box at the bottom: [R^ , R^, ...R^^ ]=[B^^.R.bottom, B^ ^.R.top, B^^.R.bottom, B^ 
^.R.top, ...B^.R. bottom, B^.R.top] ; 

5. Set NumPtsL=2*n;NumPtsR=2*n; Count=0; 

6. do{ count=0; 

for l=0:NumPtsL 

i/connecting L. to is a right up move (i.e. L..x< L.^^.x and L..y< L.^^.y), 

Remove L.^^from L[], count-l-l-; NumPtsL—;( figure 4(bl)) 
else i/connecting D to is a left up move (i.e. L^.x> L.^^.x and L,.y< P,^j.y) 

Remove L.from L[], count-l-l-; NumPtsL—;( figure 4(b2)) 

}while(count>0) 
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7. do( count=0; 

for I=0:NumPts-l 

connecting R. is a right down move (i.e. R..x< R,^,-X and 7?..y< R,^,.y), 

Remove R.from R[], count++; NumPtsR—; ( figure 4(b2)} 
else i/ connecting R, ->R,^, is a left down move (i.e. R,.x> and R,.y< 

Remove R,^,from R[], count++; NumPtsR— ;( figure 4(bl )) 

)while(count>0) 

8. Connect all the points in the order of L„ -> L^, R„ ->R 2 , -> 

where NumPtsL is the number of points in the set L and NumPtsL is the number of points in 
the set R. 

Similar to horizontal lines, to find the outline of a vertical segment, we first con- 
nected all the top corner points of the bounding boxes of the sorted text lines, then 
connected all the bottom points of the bounding boxes of sorted text lines as shown in 
figure 4c. When connecting points on the top, only a move right was allowed; when 
connecting points at the bottom, only a move left was allowed as shown in figure 4d. 
The outlines of the segments were expressed as the polygons. The connecting points 
were the vertex of the polygons. 

2.4.6 Extracting the Content of the Text Segments 

After all the text segments were identified, text contents within the text segments 
were formed by concatenating the text strings of the sorted text lines within the seg- 
ments. 

2.5 Image Components Identification and Extraction 

To identify the image components on the page, we could convert the image-only PDF 
document to a bitmap image and apply document image segmentation. However, 
since image objects in PDF are well-defined logical units (different from text and 
path objects). Image components were identified directly from PDF document. 

Each image object in the PDF document page content stream was identified as a 
logical component of the document, the image component. The image data stream 
was extracted and saved as an image file. If the image had a mask that gave the spe- 
cial effect to the image or a clipping path that defined the visible portion of the image, 
the mask and clipping path were extracted and the image object was saved as an SVG 
image object and an SVG file would be created. An XML file, which described the 
bounding box of the image component on the page, the width and height of the image 
and the type of the image decoder, was created. It also referenced the image file name 
or the SVG file name. 

2.6 Vector Graphics Segmentation and Extraction 

To find the vector graphics component blocks on the page, we had the choice of ana- 
lyzing PDF directly [16] or applying document image segmentation technique in the 
automatic mode [18, 19]. We found in a lot of cases, for vector graphics components, 
it was more straightforward to analyze the bitmap of the page than to follow the 
drawings of the path objects on the page. Therefore we applied document image seg- 
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mentation tool to find the regions that represented different components. The outline 
of each region was used to find the path objects within each component on the PDF 
page. With those path objects, we created the SVG representation for the logical com- 
ponent. 

2.6.1 Vector Graphics Zoning 

We converted the path-objects-only PDF page into a bitmap image and applied a 
document image segmentation tool to the bitmap image. Different zones were identi- 
fied. Each zone represented a logical component. The outline of each zone was ex- 
pressed in a polygon. 

2.6.2 Identifying the Path Objects Within Each Block 

In PDF document, drawings or path objects are expressed as a combination of lines, 
curves and rectangles. In this step, the coordinates of the polygon outline of each 
region in the bitmap image coordinate space were converted to the coordinates in 
PDF page coordinate space, the path objects enclosed in the polygons were identified 
and the corresponding SVG objects were created. 

To find out if a path object in the PDF document belonged a region, we could 
check if all the points along the lines, curves and rectangles of the path objects were 
within the polygon outline of the region. To reduce the computing complexity, we 
only checked the starting and the ending points of the lines and curves and the four 
corner points of the rectangles. If and only if the starting and end points for lines and 
curves and four corner points of the rectangles of all the sub-segments of a path ob- 
ject were inside the polygon [22], then the path object was identified as part of the 
logical component. All the identified path objects were translated into SVG path 
objects. An SVG file was created for each logical component. 

If there were still unidentified path objects left, which didn’t belong to any existing 
logical components, the unidentified path objects were grouped to form new logical 
components based on their attachment, vicinity and layering order [17]. Again an 
SVG file was created for each logical component. 

2.7 Combining Components into a Compound Component 

If a component block discovered in 2. 4-2. 6 was completely embedded in another 
block, the two components were most likely to be one logical component. Therefore 
in this step, they were combined into a compound component with two subcompo- 
nents. This process was repeated until there was no one component being completely 
embedded in another component. The subcomponents remained to be accessible logi- 
cal units that could be modified or replaced. 

3 Results 

Figure 4 shows the results of applying our tool to a PDF page at different processing 
steps. The outlines of different segments were highlighted, with the text segments in 
green, the image segments in red and the vector graphics segments in blue. The text 
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contents were expressed in XML files. Images were saved as image or SVG files and 
vector graphics components were saved as SVG files. 
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Fig. 4. An example that shows how the system works hy giving the results at each processing 
step. The page was separated into text, image and vector graphics layer; each layer was ana- 
lyzed to identify logical-component blocks. The style and content of the components were 
extracted. 



We have tested our tool on 100 pages of HP marketing collaterals and 100 pages 
of traveling brochures. Segmentation results were manually checked. Segmentation 
errors happened in 18 pages among these 200 pages. Errors were in three categories: 
(1) pages with tables, an example is shown in figure 5a, in which the text in different 
cells were grouped into the same segment; (2) text on a map, an example is shown in 
figure 5b, in which text that were close to each other, but indicate different locations 
on the map were grouped into the same segment; (3) text style changing after a colon, 
an example is shown in figure 5c. To overcome these errors, methods to identify 
special kinds of logical components need to be developed, such as table and map 
recognition. A user instruction would also help to improve the precision of results. 
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(b) 

Fig. 5. Examples show the segmentation errors. The segmentation error happened when the text 
were embedded in a table (a); when text were embedded in a map (b); and when the text style 
changed after the colons (c). 





4 Conclusion and Discussion 

We have developed techniques to identify logical components on a PDF page. The 
outlines of the segments were determined and the style and content within the com- 
ponents were extracted and expressed in an XML file. A user could treat the outline 
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and style attributes of the logical components on the page as a layout template for the 
new page creation. A user could also take a logical component to modify it or to re- 
use it in another document. 

However, page layout is more than the outline of the document, it is also the rela- 
tion constraints between the logical components such as alignments and white spaces. 
Further research need to be done to discover the layout constraints so that a more 
dynamic layout reuse can be allowed. 
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Abstract. In this paper we propose a new method to extract sepa- 
rately filled-in items from Japanese bank-checks based on prior knowl- 
edge about their color characteristics and layout. We have analyzed the 
bank-check characteristics and proposed a model that can be used to 
extract the filled-in items applicable to any Japanese bank-checks. The 
areas where the filled-in item is supposed to appear are first extracted 
through a template. Then the filled-in characters and seal imprints are 
extracted on the basis of their color characteristics in HSV color space. 
The results of testing experiments show that this extraction method is 
capable of extracting the filled-in items from most Japanese bank-checks. 



1 Introduction 

Automatic bank-check reading and verifying are an active topic in the field of 
document analysis. Many researches on bank-check processing have been per- 
formed up to the present [1], [2]. In order to develop a bank-check processing 
system, several modules such as those for extraction of filled-in items, for recog- 
nition of handwritten and printed characters and for verification of signature or 
seal imprint must be integrated. Especially, the extraction of filled-in items is an 
important preprocessing step to facilitate subsequent high accuracy recognition 
and verification. Therefore there are many reports on the extraction of filled-in 
items from bank-checks [1], [3], [4]. 

However these methods cannot be applied directly to Japanese bank-checks 
due to their specific properties. In the case of Japanese bank-checks, a signature 
(generally a stamp of payer’s name) is written or stamped on a bank-check, which 
has a various background pattern. Then a seal is stamped superimposed over the 
signature. Fig. 1 shows an example of Japanese bank-checks. The seal imprint 
is used as a more important clue than the signature for validity confirmation of 
bank-checks. Therefore it is important to separately extract the signature and 
seal imprint from bank-checks. 

In this paper we propose a new method to extract accurately the filled-in 
items from Japanese bank-checks based on the above background. This method 
extracts the filled-in items based on their color characteristics in HSV color 
space. We will also present that the proposed method can extract the filled-in 
items with good quality for subsequent recognition and verification processes. 
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Fig. 1. An example of Japanese bank-checks. 



2 Proposed Method 

The flow diagram of the proposed method is shown in Fig. 2. Since the size and 
layout of all items of Japanese bank-checks have been standardized as shown in 
Fig. 1, we first extract the areas to be processed through a template. Then the 
extracted areas are converted into HSV color space. Successively fllled-in char- 
acters (amount, date and signature) and seal imprint are extracted separately 
based on their color characteristics in HSV color space. Finally isolated noises 
and preprinted characters are removed from the extracted item images. 




Fig. 2. Flow diagram of the proposed extraction method. 



Japanese bank-checks have a common feature in respect to color; the back- 
ground patterns are various light colors, the signatures are black or blue in color, 
and the seal imprints are vermilion or red in color. Therefore it is reasonable to 
assume that pixels of the color check image construct four clusters correspond- 
ing to the background B, the characters C, the seal imprint I, and superimposed 
area of signature and seal imprint IC. In order to obtain color characteristics of 
Japanese bank-checks, we calculated a histogram of pixels of real bank-checks in 
S-V color space. Fig. 3(a) shows an example of the histogram of a check image. 
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Fig. 3. Histogram and color model of check image in S-V space. 



As a result of having analyzed about many checks, we confirmed that most of 
check images had similar histogram and constructed four clusters as shown in 
Fig. 3(a). Therefore we define the color model of check images in S-V space as 
shown in Fig. 3(b). 

Character Extraction. The intensity value of characters involving the super- 
imposed area of signature and seal imprint is lower comparing with those of 
other areas as shown in Fig. 3. Therefore we extract pixels of which intensity 
value V is threshold value Tc (Solid line in Fig. 3(b)) or less as character pixels. 
Let Vmax be the peak position of the histogram of area B on the intensity value 
axis. Threshold Tc is defined by following equation as an experimental optimum. 

J. ^ I Knax- 50 :S<100 , . 

" \Vnax-50-i(5-100):.S>100 

After this processing, preprinted characters are removed from the extracted im- 
age based on their position, size and line width. 

Seal Imprint Extraction. Pixels of seal imprint and blue characters are lo- 
cated in an area of higher saturation value. Therefore we first remove blue pixels 
using iJ (Hue) value in order to extract only seal imprint. Then we extract pix- 
els of which saturation value S is threshold value Tj (Broken line in Fig. 3(b)) 
or higher as seal imprint pixels. The superimposition area of signature and seal 
imprint can also be extracted as seal imprint, because S of the superimposition 
area is higher comparing with those of background and characters. Let Smax 
be the peak position of the histogram of area B on the saturation value axis. 
Threshold Tj is defined by following equation as an experimental optimum. 



f Til '■ Til > Ti2 
\ Ti2 : Til < Ti2 



( 2 ) 



where Tn = Smax + 85, and T /2 = 200 — V . 
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3 Experimental Results 

We performed the extraction experiment with 19 Japanese bank-checks filled up 
by various colors of ink which are supposed to appear in practical situations. 
Although the number of checks used for experiment is few, these represent the 
properties of most checks used in practical situations. We inspected visually the 
extraction results. As a result, it was confirmed that all samples in character 
extraction and 18 samples in seal imprint extraction were extracted precisely. 
Fig. 4 shows the sample result of filled-in items extracted from the check image 
shown in Fig. 1. All items were extracted completely. Especially, it is important 
that the signature and the seal imprint superimposed over each other were able 
to be extracted precisely by this method as shown in Fig. 4(c) and (d). 
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Fig. 4. Example of Filled-in items extracted from the check shown in Fig.l. 



4 Conclusion 

In this paper, we proposed automatic extraction method of filled-in items from 

Japanese bank-check images. The results of testing experiments showed that this 

method is applicable for most Japanese bank-checks. We will investigate more 

precise evaluation of this method with the large number of samples in the future. 
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Abstract. A color decorrelation strategy to improve the human or au- 
tomatic readability of degraded documents is presented. The particular 
degradation that is considered here is bleed-through, that is, a pattern 
that interferes with the text to be read due to seeping of ink from the re- 
verse side of the document. A simplified linear model for this degradation 
is introduced to permit the application of very fast decorrelation tech- 
niques to the RGB components of the color data images, and to compare 
this strategy to the independent component analysis approach. Some ex- 
amples from an extensive experimentation with real ancient documents 
are described, and the possibility to further improve the restoration per- 
formance by using hyperspectral/multispectral data is envisaged. 



1 Introduction 

Improving the readability of printed or manuscript documents is a common 
need in libraries and archives. The original documents should not be altered, 
but the availability of more readable digital versions is an important aid to the 
scholar. Furthermore, one of the main tasks in document analysis is to produce 
machine-readable versions of original texts. This task should be performed au- 
tomatically, or, at least, minimizing human intervention. This is normally done 
by optical character recognition systems, whose performance, however, depends 
on the quality of the original documents. Ancient documents, in particular, are 
often affected by several types of degradations. Restoring the digital scans of the 
original documents is thus essential to improve human readability and to regain 
acceptable OCR performances. 

The particular kind of degradation we are considering here is Meed- through, 
that is, presence of patterns interfering with the main text due to seeping of 
ink from the reverse page side. Removing the bleed-through pattern from a dig- 
ital image of a document is not trivial, especially with ancient originals, where 
interferences of this kind are usually very strong. Indeed, dealing with strong 
bleed-through degradation is practically impossible by any simple thresholding 
technique. Some work done on this specific problem has exploited information 
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from the front and back pages [11] [4] [2]. The drawbacks of this type of tech- 
niques are that the scans from both sides of the documents must be available, 
and a preliminary registration of the two sides is needed. In addition, they are 
usually expensive, as the processing may be quite complicated. In [9], a color 
scan from a single side is required, but a thresholding technique can only be used 
in the framework of multiresolution analysis and adaptive binarization. 

Our approach to this problem is to model a document image as a linear 
combination of three independent patterns: the main foreground text, the bleed- 
through, and the background pattern, i.e. an image of the paper or any other 
support, which can contain various interfering features, such as stains, color 
inhomogeneities, textures, etc. Our scope is to obtain a clean main text, by re- 
ducing the interferences from bleed- through and background. This goal can be 
achieved by processing multiple “views” of the mixed object. When a color scan 
of the document is available, three different views can be obtained from the red, 
green, and blue image channels, but scans at nonvisible wavelengths can also be 
available. By processing the different color components, it is possible to extract 
the main text pattern, and, sometimes, even to achieve a complete separation of 
the overlapped patterns. Although our linear image model is known to be naive 
[11], it has already proved to give interesting results for extracting the hidden 
texts from color images of palimpsests, assuming to evaluate by visual inspec- 
tion the mixture coefficients [5] . Nevertheless, in general, the mixture coefficients 
are not known, and the separation problem becomes one of blind source sepa- 
ration (BSS). An effective solution to BSS can be found if the source patterns 
are mutually independent, by using separation techniques based on indepen- 
dent component analysis, or ICA [7]. We already proposed ICA for document 
processing [13], obtaining good results with real manuscripts. 

Instead of enforcing independence, in this paper we only try to decorrelate the 
observed data. As is known, this requirement is weaker than independence, and 
can be satisfied by transforming the data in order to get zero cross-covariances. 
Conversely, independence implies that the cross-central-moments of all orders 
between each pair of estimated sources must be zero. In principle, no source 
separation can be obtained by only constraining second-order statistics, at least 
if no additional requirements are satisfied [3]. However, our present aim is not full 
separation but interference reduction, and this can often be achieved even by only 
constraining second-order statistics. Furthermore, the second-order approach is 
always less expensive than most ICA algorithms and, in many cases, it is even 
more effective for our purposes. Enforcing statistical uncorrelation is equivalent 
to orthogonalize the different data images. The result of orthogonalization is 
of course not unique. We experimentally tested the performances of different 
strategies, and compared the results to the ones obtained by the ICA approach. 

The paper is organized as follows. In Section 2, we introduce our linear data 
model. In Section 3, we recall the properties of different orthogonalization ma- 
trices, and, in Section 4, we present some experimental results with real printed 
or manuscript documents. Some final remarks highlight the promises of applying 
this type of techniques to color document processing. 
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2 Data Model 

Let us assume to have a collection of T samples from a random fV-vector x, 
which is generated by linearly mixing the components of a random M-vector s 
through an TV X M mixing matrix A\ 

x{t)=As{t) t=l,2,...,T (1) 

Our present aim is to describe a color document image by means of a model 
of the type (1). Let us assume that each pixel (of index t) of a color scan of a 
document has a vector value x(t). Normally, the color vector has dimension 3 (it 
is composed, for example, by the red, green, and blue channels), thus we have 
= 3. In our case, let us also assume that the image can be modeled as the su- 
perposition of three different sources, or classes, that we will call “background”, 
“main text” and “bleed-through”. Thus, we also have M = 3. In general, by 
using hyperspectral sensors, the “color” vector can assume a dimension greater 
than 3. Likewise, we can also have M > 3 if additional patterns are present in 
the original document. In this paper, we only consider the case M = N = 3, al- 
though in principle there is no difference with the general case. Since we consider 
images of documents containing text, we can also reasonably assume that the 
color of each source is almost uniform, i.e., we will have mean reflectance indices 
(ri,gi, 6 i) for the background, (r 2 , 327 ^ 2 ) for the main text and (^ 3 , 33 , 63 ) for 
the bleed-through. In this application, we assume that the three sources mix 
linearly at each pixel, and that noise and blur can be neglected. By this approx- 
imated model, the reflectance indices (xr{t) , Xg{t) , Xb{t)) of a generic point t of 
the document are given by: 



Xr{t) 




ri T2 T3 






Xgit) 


= 


91 92 93 




S2{t) 


_Xb{t)_ 




bi 62 bs 




CO 

1 



where functions Si{t), i = 1,2,3 denote the “quantity” of background, main text 
and bleed-through, respectively, that concur to form the color at point t. For 
instance, a pure background point t will be represented as s{t) = (1,0,0). Eq. 
( 2 ) has the same form of eq. ( 1 ), restricted to the 3x3 case, where parameters 
Ti, gi, and bi are the coefficients of the mixing matrix A, and functions Si(t) are 
the sources. 

However, this model does not perfectly account for the phenomenon of bleed- 
through. Just to mention one aspect, in the pixels where the bleed-through is 
superimposed to the main text, the resulting color is not the vector sum of the 
colors of the two components, but it is likely to be some nonlinear combination 
of them. In [11], a nonlinear model is also derived for the phenomenon of show- 
through (interfering pattern from the reverse side due to transparency of the 
paper). Although the linear model is only a rough approximation, it has already 
been useful in several applications [5] [13]. As already mentioned, if both s and A 
in ( 1 ) are unknown, estimating them from x(t) alone is called a problem of blind 
source separation. For it to be uniquely solvable, some additional information is 
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needed. For example, if the components of vector s are statistically independent 
from each other, it can be solved by the ICA techniques [7]. Some good result can 
also be obtained even if the independence assumption is not completely verified. 
We already addressed this issue in [13]. In our work, we have found that, in 
many cases where ICA was not able to separate the individual sources, in some 
of our outputs the interfering patterns were greatly reduced. The failure of ICA 
can be ascribed to both the approximated model and the significant correlation 
between different sources. We found that, to reduce the presence of the interfering 
patterns, it suffices to decorrelate the signals, without requiring their mutual 
independence. Decorrelation is faster than ICA, but it is not unique, and it is 
interesting to assess experimentally the performances of different decorrelation 
strategies in terms of interference reduction. 

3 Processing Strategy 

Since we have no physical grounds to justify the ICA assumptions, the ICA 
output processes will not be guaranteed to be replicas of the original sources. 
However, one can try to maximize the information content in each component of 
the data vector by decorrelating the observed image channels. This amounts to 
force the cross-covariances of the data to zero. To avoid cumbersome notation, 
and without loss of generality, let us assume to have zero-mean data vectors. To 
obtain zero cross-covariances, we seek for a linear transformation 

y{t) = Wx(t) (3) 

such that: 

< ViVj >= 0, Vi, j = 1, . . . , M, z yf j (4) 

where the notation < • > means expectation, and W is generally an M x N 
matrix. In other words, the components of the transformed data vector y are 
orthogonal. It is clear that this operation is not unique, since, given any or- 
thonormal basis, any rigid rotation yields another orthonormal basis spanning 
the same subspace. 

What we want to stress here is that linear data processing can help to restore 
color text images, although the linear model is not fully justified. This strategy 
resembles the standard color space transformations in image processing, which 
are able to enhance specific perceptual information in color images [10] [14] [1] 
[8]. In [10], the authors compare many fixed color transformations on the basis of 
their effects on the performance of a particular recursive segmentation algorithm. 
They argue that, among linear transformations, the ones that obtain maximum- 
variance components are the most effective. They thus derive a criterion for 
the evaluation of fixed transformations in comparison with the Karhunen-Loeve 
transformation, which is known to give orthogonal output vectors. This approach 
is also called principal component analysis (PC A), and one of its purposes is to 
find the most useful (the ones that have dominant variances) among a number 
of variables [3] . Our data covariance matrix is the N x N matrix: 

7?XX =< XX^ > 



(5) 
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where superscript T means transposition. Since we do not have the probability 
density function of vector x, we are not able to compute the expectations needed. 
An approximated estimate of the covariance matrix can be drawn from the 
sample at our disposal, that is, from the RGB components of our image: 

t=i 

Since the data are normally correlated, matrix i?xx will be nondiagonal. The 
covariance matrix of vector y defined in (3) is 

Ryy =< Vbxx^fT^ >= ITRxxbb^ (7) 

To obtain property (4) for y, we should require that matrix Ryy is diagonal. 

Ryy = WR^^W^ = Dm ( 8 ) 

where Dm is any diagonal matrix of order M . Let us now perform the eigenvalue 
decomposition of matrix i?xx 



i?xx = (9) 

where is the matrix of the eigenvectors of i?xx> and is the diagonal matrix 
of its eigenvalues, in decreasing order. Now, it is easy to verify that all of the 
following choices for W yield a diagonal Ryy. 

Wo = vj (10) 

W^=A~-^Vj ( 11 ) 

Ws = V^AZ^Vj ( 12 ) 

Matrix A^ ^ is a diagonal matrix whose elements are the reciprocals of the square 
roots of the elements of Ax- Matrix Wo produces a set of vectors yi{t) that are 
orthogonal to each other; indeed, from (7), (9) and (10): 

WoR^^Wj = Vj V^A^Vj Mx = Ax (13) 

Vectors yi are thus mutually orthogonal, and their Euclidean norms are equal 
to the eigenvalues of the data covariance matrix. This is what PGA does [3]. By 
using matrix W^, we obtain a set of orthogonal vectors of unit norms. From (7), 
(9) and (11), we have: 

W^R^^Wj = AZhj' VxAxPx^ VxAx ^ = In (14) 

the orthogonal vectors yi(t) are thus on a spherical surface {whitening, or Maha- 
lanohis transform). Note that any whitening matrix can be multiplied from the 
left by an orthogonal matrix, and relation (14) still holds true. In particular, if 
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we use matrix Wg defined in (12), we have a whitening matrix with the further 
property of being symmetric. In [3], it is observed that application of matrix Wg 
is equivalent to ICA when matrix A is symmetric. Our experimental work has 
consisted in applying the above matrices to typical images of ancient documents, 
assuming a linear model of type (2) to be valid, with the aim at emphasizing 
the main text in the whitened vectors and reducing the influence of background 
and bleed-through. 

For each test image, the results are of course different for different whitening 
matrices. However, it is interesting to note that, for bleed-through reduction, 
the symmetric whitening matrix often performs better than ICA, which applies 
a further rotation to the output vectors, based on higher-order statistics. On the 
other hand, in some cases, whitening can also achieve a separation of the three 
different components, which is the final aim of ICA processing. 



4 Experimental Results 

In this section, we show some examples from our extensive experimentation on 
real degraded documents. From this set of experiments, we can already draw 
some general considerations. A quantitative assessment of the technique should 
rely on “ground truth” data, which of course are not available when treating 
real images. To assess the technique against synthetic images, these should be 
generated from a realistic mixture model, since evaluating the results obtained 
from a model that perfectly fits the data would be almost meaningless. Figure 1 
shows, in grayscale, an ancient manuscript affected by strong bleed-through, one 
output from symmetric whitening, and one output of the FastICA algorithm [6] 
[13], both obtained from the RGB components of the original color image. The 
PCA result is not shown, since its best output is similar to a graylevel version 
of the original; no text enhancement has been achieved. Conversely, the result 
from symmetric whitening shows an effective bleed-through reduction. Instead, 
in this case, FastICA failed to achieve a complete class separation, and was not 
able to produce one output with a clean main text pattern. As mentioned above, 
if both the data model is accurate and the image classes are mutually indepen- 
dent, FastICA processing should effectively separate the classes from the mixed 
RGB data. In this case, none of the two hypotheses is verified, and symmetric 
orthogonalization apparently performs better than ICA. In Figure 2, we report 
an example where the three methods perform similarly, though the result of 
symmetric orthogonalization is slightly better than those of FastICA and PCA. 
However, FastICA produced one output with a clean main text pattern, even if 
it still failed to achieve a complete class separation. In Figure 3, the grayscale 
image of another ancient manuscript with strong bleed-through and one of the 
RGB orthogonalization outputs are shown. The apparent bleed-through sup- 
pression clearly improves the human legibility of the document. To show also 
the advantages of orthogonalization versus an improvement of the performance 
of OCR systems, in Figure 4 we report the results of applying the QIR optimal 
thresholding technique [12] to the grayscale document scan and to the orthogo- 
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Fig. 1. From top to bottom: Scan of an ancient manuscript with bleed-through inter- 
ference; Selected symmetric orthogonalization output from the RGB components of 
the color image; Selected FastICA output from the same data set. 
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Fig. 2. Left: Scan of a manuscript with bleed-through interference. Middle: Selected 
symmetric orthogonalization output from the RGB components. Right: Selected Fas- 
tlCA output from the same data set. 






' Ss-^pfT^^ 









.|4«> r»*-**is ^ 

44 ^ -5c v*f W^psvt' " 



C / )f 

V444 



Cc 

.> 



( ^■»vtiCzt'»»t»S») 



ph^xy »>.»t ti' 

-Sc iv-wvc i‘ 



*w -9.C 

‘V.. X i.cul 



Fig. 3. Top: Scan of an ancient manuscript with bleed-through. Bottom: One of the 
symmetric orthogonalization outputs from the RGB components. 
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Fig. 4. Top; QIR thresholding on the grayscale original in Fig. 3. Bottom: QIR thresh- 
olding on the orthogonalization output in Fig. 3. 





Fig. 5. Left: Histogram of the grayscale version of the original in Fig. 3. Right: His- 
togram of the orthogonalization output in Fig. 3. 



nalization output. As is seen, optimal thresholding has been much more effective 
on the orthogonalized image. A reason for this can be found by observing the 
histograms in Figure 5. The sensible peak of the histogram of the graylevel orig- 
inal (on the left) at about level 150 is due to bleed-through. This peak does not 
appear in the histogram of the processed image, shown in the right-hand panel 
of Figure 5. The optimal threshold established by the QIR procedure is thus able 
to avoid artifacts due to bleed-through in the binarized image. Figure 6 shows an 
example where symmetric orthogonalization achieved full separation of the two 
overlapped texts, the main foreground text and the bleed-through pattern, which 
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Fig. 6. Top: Scan of an ancient manuscript (from: http://www.site.uottawa.ca/ 
'edubois/documents). Middle and Bottom: Main text and bleed-through patterns ex- 
tracted through symmetric orthogonalization of the RGB channels. 



was not possible via optimal thresholding. As a last example, we show a case 
where an effective bleed-through cancellation has been added with a focusing 
of the main text pattern. In Figure 7, the grayscale original and the processed 
image of an ancient printed page are shown. In this case, bleed-through is weaker 
than in the cases presented above. It can be seen, however, that the main text 
appears more focused in the processed output than in the original. 
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Fig. 7. Left: Scan of an ancient printed page with bleed-through. Right: One of its 
symmetric orthogonalization outputs. 



5 Conclusions 

We demonstrated a strategy that has proved to be effective in bleed-through 
cancellation from color or hyperspectral scans of degraded documents. The ad- 
vantage of this technique over many other strategies is that it is quite simple and 
fast, and it does not require reverse side scans or registration prior to cancella- 
tion. Our approach lacks a true theoretical justification, however, an extensive 
experimentation has shown that one of the output channels from symmetric or- 
thogonalization is always a more or less “clean” image of the main text pattern 
in the original document. Moreover, the application of other standard enhance- 
ment techniques is very effective if performed after orthogonalization. More gen- 
eral and quantitative results could be obtained from an experimentation with 
ground truth data available, that is, with synthetic images. However, a necessary 
step towards this goal is the development of an accurate numerical model for 
the bleed-through interference. 

Another necessary step towards a complete evaluation of this restoration 
strategy is to assess the potentiality of using more channels than the common 
RGB components from data collected by a color camera. In particular, it would 
be interesting to know which is the optimum number of data images. A viable 
approach to this problem could be the analysis of the spectra of the data covari- 
ance matrices [3] . However, we do not expect that the number of significant data 
images is independent of the type of document, and we have already evidence 
towards this conclusion. A complete assessment will be made as soon as a richer 
database of hyperspectral images will be available to us. 
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Abstract. This paper presents our work on color map classification for archive 
documents. An approach is proposed which is very similar to the way that humans 
perceive document colors, by using the diversity quantization algorithm applied 
within the HSV colour space. Template color maps are manually registered for 
sample document images; images from batches to be classified are then mapped 
onto the closest template color map using a fuzzy color classification algorithm. 
The results of testing this approach on batches of archive index cards from the UK 
Natural History Museum were very encouraging. We tested over 400 biological 
specimen index cards, and achieved more than 98% correct color classification. 



1 Introduction 

Archive document image conversion for online digital libraries is a major application 
area of interest both for cultural purposes and where historical data need to be compared 
(e.g. for biodiversity or climate change monitoring). We have developed a prototype 
user-assisted Archive Document Image Analysis System [1], which automatically clas- 
sifies the physical text contents of structured archive documents into a user-defined log- 
ical structure. The current system achieves about 93% correct classification of archive 
document text fields, but its performance is limited by the simple grey-level binarization 
algorithm used to separate foreground text from background card texture. While such 
algorithms are effective on clean modern printed documents, archive documents often 
show significant variations in background texture and colour due to ageing and fad- 
ing, and foreground text may be written, typed, stamped or printed with several colours 
in a single document. Color rather than grey-scale segmentation of archive documents 
not only allows more accurate segmentation of the document content into independent 
colour planes, but may also help in labelling each plane (e.g. as containing a species 
name, annotation, or validation stamp) for subsequent OCR and database entry. On the 
one hand, this improves the quality of the segmented texture; on the other, it reduces the 
complexity of subsequent analysis by partially pre-segmenting the different text fields 
of fhe document. 

To implement color segmentation, it is first necessary to obtain color maps that are 
representative of most of the archive document colors in a batch of documents. How- 
ever, both background and foreground colors on archive documents vary significantly. 
The variation can be classified info fwo aspects, intra-class color variation, where the 
color (e.g. red) of a document component varies because of variable illumination or 
age-related fading; and inter-class color variation, where different images may have 
different numbers of colored document components due to varying document content. 

S. Marinai and A. Dengel (Eds.): DAS 2004, LNCS 3163, pp. 241-251, 2004. 

(gl Springer- Verlag Berlin Heidelberg 2004 
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If too many colors are identified for a batch of archive documents, or an improper color 
map is assigned to an image, this will increase both classification errors and computa- 
tional complexity. Therefore, we propose a method that uses HS V color space modelled 
by fuzzy logic [2] [3] to represent colors, and the diversity color quantisation algorithm 
[4] to detect the proper color map of an image. 

The paper is organised as follows: Section 2 introduces modeling the HSV color 
space with fuzzy logic. Section 3 describes the color map template registration process. 
Section 4 describes how the color map for a sample document is identified by using 
the diversity color quantisation algorithm. Section 5 explains the classification process 
used for mapping the detected quantised color map onto the correct registered template. 
Section 6 describes our evaluation of the colour segmentation process and the results 
obtained. Section 7 draws conclusions. 

2 HSV Color Space Modelling with Fuzzy Logic [5] 

The HSV model is well-suited for mapping a linguistic representation of color, because 
of its similarities to the way humans tend to perceive color (e.g. grouping close colors 
together). It defines color space in terms of three parameters, hue (H), saturation (S) 
and value (V) [6]. Hue describes the basic color (e.g. red, blue or green) in terms of 
its angular position on a ’color wheel’ from 0 to 360 degrees. Saturation describes the 
purity of the color, ranging between 0 and 100%. The higher the value of saturation, 
the purer is the color; and as saturation decreases, the color turns to grey and eventually 
white. Value, often called intensity, describes the brightness of the color, and also ranges 
between 0 and 100%. With decreasing intensity, the color turns to black. 

2.1 Modeling with Fuzzy Logic 

In archive documents, small amounts of color dominate the image foreground. There- 
fore, it is sufficient to represent document colors by dividing the color space into rela- 
tively big regions. First of all, we convert the 3-dimensional model into 2-dimensions 
(HS color space) by simply dropping the V coordinate, except for black representation. 
This has two advantages. Firstly, the illumination distortion effects, (variation of bright- 
ness) caused by the document scanning process or simply by colors fading with age, 
are reduced because they mainly affect the intensity value. Secondly, the 2-dimensional 
color model is much easier to visualise. The conversion is carried out by dividing the 
intensity range 0-100 into two regions, “LOW” and “HIGH” as shown in Figure 1. Any 
color C(h,s,v) whose intensity value falls in the range “LOW” is considered as the lin- 
guistic color black (C(h,s,v) = black), otherwise it is in HS color space (in other words, 
its intensity value is dropped C(h,s) = C(h,s,v)). 

In 2-dimensions, HS color space can then be modelled, where hue (H) is equally 
divided into 12 regions as shown in figure 2 and each region corresponds to a linguis- 
tic color (e.g. red). Saturation (S) is divided into 3 regions, “LOW”, “MEDIUM” and 
“HIGH” as shown in figure 3. Any color C(h,s) with saturation “LOW” in HS space 
is considered to be the linguistic color white (C(h,s) = white)', a color with saturation 
“MEDIUM” is considered to be desaturated', and a color with saturation “HIGH” is 
considered as pure. 
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Fig. 1. Membership function for value (V) Fig. 2. Membership function for hue (H) 





Fig. 3. Membership function for saturation (S) 



Fig. 4. HS color palette 



The HSV color space, therefore, can be represented by a 2-dimensional HS-based 
color palette as shown in figure 4. The palette comprises four circles, each inside the 
other with the same centroid. The region inside the most central circle represents the 
color black. The ring region outside the central circle represents the color white. The 
remaining two ring regions are equally divided into 12 radial sections. Sections in the 
outermost ring region represent pure colors (e.g. red), while sections in the intermediate 
ring region represent desaturated colors (e.g. grey red). 

2.2 Color Reasoning 

Colors in HSV space are classified into specific regions on the defined palette by us- 
ing a set of “IF-THEN” rules. First of all, we assign weights for each fuzzy set. For 
Hue, “RED” weighs 1, “YELLOW” weighs 3, and so on, round to 12; for Saturation, 
“LOW” weighs -1, “MEDIUM” weighs 1 and “HIGH” weighs 13. Regardless of the 
values of Hue and Saturation, if the Intensity is classified as “LOW”, the color must 
be “BLACK”. Therefore, the weight assigned to Intensity “LOW” must override others 
when combining the HSV parameters; this is achieved by selecting weights of 0 for 
“LOW” and 1 for “HIGH” in the representation of Intensity. Then, the product opera- 
tor is applied to calculate the overall weight Cw{h, s, v) = Cyj(h).Cw(s).Cyj(v). Each 
region on the palette has a specific value of Cw{h, s, v) as shown in figure 5. 
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Fig. 5. Weight assignation on the fuzzy HSV palette 



Table 1. Partial Color Reasoning 



Hue 


Weight 


Saturation 


Weight 


Value 


Weight 


Color 


Weight 


RED 


1 


HIGH 


13 


HIGH 


1 


RED 


13 


RED 


1 


LOW 


-1 


LOW 


0 


BLACK 


0 


RED 


1 


MEDIUM 


1 


HIGH 


1 


GREY RED 


1 


RED 


1 


LOW 


-1 


HIGH 


1 


WHITE 


-1«0) 


YELLOW 


3 


HIGH 


13 


HIGH 


1 


YELLOW 


39 


YELLOW 


3 


HIGH 


13 


LOW 


0 


BLACK 


0 


YELLOW 


3 


MEDIUM 


1 


HIGH 


1 


GREY YELLOW 


3 


YELLOW 


3 


LOW 


-1 


HIGH 


1 


WHITE 


-3«0) 



For the BLACK region, the value of Cw {h,s,v)is always 0; for the WHITE region, 
the value of C^{h, s, v) is always negative; for other regions, the value of Cw{h, s, v) is 
unique. The required “IF-THEN” rules for color reasoning can then be systematically 
derived from the weights on the color wheel, e.g. if hue C{h) is “RED” and saturation 
(^(s) is “HIGH” and value C{v) is “HIGH”, then the color C{h,s,v) is “RED”. If 
hue C{h) is “RED” and saturation C{s) is “HIGH” and value C{v) is “LOW”, then 
the color C(h, s, v) is “BLACK”. Table 1 shows some examples of color reasoning. In 
sections 3 and 4, reasoned colors are used to represent the image color map. 

3 Color Map Registration 

Sample color patches are selected from representative sample images (figure 6(a)(b)) 
by rubber-banding suitable areas using a user interface. As background color varies 
significantly image by image in archive documents, an exclusive representation for 
background is difficult to determine. Therefore, only the foreground color pixels of 
the selected sample patches (representing handwriting, typescript or stamps) are useful 
in constructing candidate template color maps. The marking mechanism also links the 
selected foreground color areas to relevant document analysis functions (e.g. text layout 
analysis, stamp detection and elimination). 
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Because some foreground areas are thin or narrow, the corresponding foreground 
samples will probably contain background as well as foreground pixels, and therefore 
require further processing to distinguish which pixels represent the distinctive fore- 
ground. Each pixel’s color is represented using 3 16-bit values in RGB color space, 
so the color histogram of each patch is initially very flat, with individual pixels rep- 
resenting subtly different colors. This makes it difficult to find a single representative 
color value to represent the whole patch. Therefore, we quantize each pixel’s color in 
each patch linearly into 5x5x5 = 125 unique colors by reducing the original 16 bit 
color representation to only 5 bits per color. This has the effect of clustering similar 
colors into a single common color value. Then, we convert the quantized colors into 
HSV space [7], and select the modal color as the representative color for that patch. 
The representative color for each patch defines the template image foreground color 
maps (figure 7(a)(b)) (which are represented by the fuzzy HSV color space modeled in 
Section 2), and are used as a mask for segmentation of images’ foreground into color 
planes (each of which is subsequently processed using one or more linked document 
analysis functions). In the examples of figure 6(a) and (b), (a) contains two different 
foreground colors, “RED” and “BLACK”, while (b) only has one foreground color, 
“BLACK” (which includes both typed data fields and handwritten annotations). 



4 Color Map Detection 

Each input image in the batch from which figure 6(a) and (b) have been selected as 
templates is first quantized linearly into 5x5x5 = 125 unique colors (as for the color 
map registration) to reduce the number of colors represented by its pixels. Any remain- 
ing colors which have less than 100 pixels are then discarded, as such small amounts of 
colour are assumed to represent noise. The remaining colors are then further quantized 
down to six distinct colors to minimise computational requirements using the diversity 
algorithm. This generates the color maps shown in figures 8(a) and (b). 

The diversity algorithm runs a histogram on the entire input image, picks the color 
with the highest pixel count (normally the background color), and then iteratively finds 
five additional colors in the unpicked list that are furthest in HSV Euclidean distance 
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Fig. 6. Biological specimen index card (a), with two foreground colors (b) with one foreground 
color 
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Fig. 7. Registered color map (a) with two foreground colors, (b) with one foreground color 



[7]) from already picked colors. All remaining pixels are then added to the nearest 
color cluster. Other quantization algorithms such as Median Cut [8], Sequential Scalar 
[9] and Lloyd [10] can perform a similar color clustering function, but the Diversity 
algorithm is the most suitable here, because it identihes the potential background color 
by picking the color with the highest pixel count as its starting point. Although the 
Popularity algorithm [8] has a similar capability to select the background color, it 
may miss distinct foreground colors with small numbers of pixels when the number 
of quantized colors is limited. The way the Diversity algorithm picks colors is similar 
to humans observing colors on documents, where foreground colors (e.g. text) which 
are distinctively different from each other stand out, whereas similar colors tend to be 
perceived as part of the same document component. 

Evaluation of our archive documents has shown that no more than 4 logically dis- 
tinct colors are found on a single image. Therefore, 6 quantized image colors should be 
enough to include all distinct foreground and background colors. In general however, 
the number of colors in the quantized color map CMAPq should be designed always 
to be more than that in the registered CMAPj.. The next section shows how to map the 
sample image’s quantized color map onto each of the template colour maps, and hence 
determine which template the sample document image matches. 

5 Mapping 

In this section, the sample image color map CMAPq generated by the diversity algo- 
rithm is matched against each registered colour map CMAPr using three steps, back- 
ground mapping, foreground mapping and color map rehnement. 

5.1 Background Mapping 

Even though no background color is recorded in CMAPr, it must still be identified 
from CMAPq, since the background color will eventually be used for color segmen- 
tation. The diversity quantization algorithm provides an easy way to identify it, as it is 
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Fig. 8. Quantized color map (a) and (b) with six colors 



always labelled first due to its dominant contribution to the overall pixel count. There- 
fore, the background color can immediately be identified, and the foreground colors are 
chosen from the remaining 5 unidentified colors. 



5.2 Foreground Mapping 

The foreground color mapping is straightforward. Each color in CMAPr is compared 
one by one with those in CMAPq. Each match scores 1 (if a color in CMAPr matches 
twice or more in CMAPq, it still scores 1). If the total score is equal to the number of 
colors in CMAPr, CMAPq is considered to have the same color map as CMAPr. If 
CMAPq can’t be mapped onto any registered CMAPr, the sample image is rejected. 
If CMAPq can be mapped onto more than one CMAPr, the one with the highest 
score is chosen. For instance, color map figure 8(a) can be mapped onto both color 
map figure 7(a) and (b). The former scores 2 and the latter 1. Therefore, the former is 
chosen. The foreground colors are therefore identified in CMAPq. However, two more 
complex cases need to be considered. First, there are some cases where unidentified 
colors remain in CMAPq after the foreground mapping process. Conversely, more than 
one color may appear within the same color class (e.g. two colors both classified as 
“BLACK”) in CMAPq. In both cases, it is necessary to decide which unmatched colors 
are useful and should be kept, and which of them are useless and should be discarded. 



5.3 Color Map Refinement 

The purpose of the refinement is to keep a reasonable number of colors (apart from those 
identified foreground colors which are already matched) in CMAPq for later color 
segmentation. The presence of too many colors may reduce the cohesion of the texture 
on each segmented color plane and increase the cost of computation. The absence of 
enough colors may result in undesirable segmentation (e.g. part of the background may 
be lost). Therefore, three rules are defined to refine the color map CMAPq. 



248 



J. He and A.C. Downton 



Rule 1: if more than one unidentified color or more than one identified foreground color 
have the same color class, these colours are merged. 

Rule 2: if any unidentified color is next to an identified color (either background or 
foreground) on the defined palette, it is merged with the adjacent colour. 

Rule 3: after applying Rule 1 and 2, if any unidentified color is still present, it is marked 
as background. 

Figures 9(a)and (b) show the final color maps after refinement. In the final stage of color 
segmentation, even though each input image may be segmented into a different total 
number of color planes (e.g. more than one background plane may appear like figure 
9(b)), the number of foreground planes will always be identical to that of the relevant 
CM APr ■ Additional background planes have no Impact on subsequent analysis, as they 
are eventually discarded. 




Fig. 9. Refined color map (a) with three colors, (b) with four colors 



6 Evaluation and Result 

6.1 Evaluation 

Evaluation of the colour segmentation algorithm was carried out on a set of 400 sample 
biological archive index cards from the UK Natural History Museum, for which two 
color maps are registered as shown in figure 7(a)and (b). None of the testing cards was 
found to contain more than 3 foreground colors. 187 of them are similar to Figure 10 
(a) (Type I), and have only one foreground color “BLACK”. 213 of them are similar 
to Figure 10(b) (Type II), and have two foreground colors “BLACK” and “RED”. 
The background colors of samples of both types vary significantly caused by color 
fading due to age and the use of different cards over time. Their foreground colors 
also vary because of fading and illumination variation. The texture with color “RED” 
includes machine-typed text (genus names), handwritten annotations, lines and stamps. 
The texture with color “BLACK” includes machine-typed text (species names and other 
data fields), handwritten annotations and stamps. 
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Fig. 10. Biological specimen index card (a), with one foreground color “BLACK” (b) with two 
foreground colors “BLACK” and “RED” 

Table 2. Result of evaluation 



Image Type 


Color Map A (Fig. 7(a)) 


Color Map B (Fig. 7(b)) 


Rejection 


Type I 


5/178(2.8%) 


173/178(97.2%) 


0% 


Type II 


211/213(99.1%) 


2/213(0.9%) 


0% 


Overall correct rate 


98.2% 



6.2 Results 

The results shown in Table 2 are very encouraging. 97.2% of Type I images have been 
assigned the correct Type I color map, while 99.1% of Type II images have been as- 
signed the correct Type 11 color map. Figure 1 1 shows the background and two fore- 
ground color planes resulting from correct classification of the Type II sample image 
shown in Figure 10(b). The small number of Type I images (figure 12(a)) with a wrongly 
assigned Type II color map result from unexpected widely distributed “RED” pixels 
(figure 12(b)). These noise pixels may be caused by mixture of the foreground and 
background colors at the time of typing. The two classification errors for Type 11 (fig- 
ure 13) images were caused by insufficient “RED” pixels in the image; as a result, the 
small number of “RED” pixels (caused by there being only a few “RED” characters 
with very thin strokes) were discarded during the initial color quantization stage. In 
both cases, the errors could be eliminated by refining the thresholding of the color map 
detection algorithm, which currently discards any color with less than 100 pixels. The 
required additional processing would be to set a higher discard threshold, but then to 
evaluate the positional variance of pixels within each quantized color below this thresh- 
old, and only retain colours with a small position variance, corresponding to correlated 
foreground print rather than distributed noise. In the case of the Type 1 images wrongly 
classified as Type II above, the “RED” pixels would then be below the increased discard 
threshold level, but are evenly distributed over the image, so would be discarded. In the 
case of the Type II images wrongly classified as Type I, the small number of “RED” 
pixels (previously discarded) would now be retained as a legitimate colour because they 
would be spatially closely correlated. 
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(C) 

Fig. 11. Successful segmented color planes (a) background (b) foreground! (text in BLACK ink) 
(c) foreground2 (stamp and line in RED ink) 
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(a) (b) 



Fig. 12. Biological specimen index card (a) with one foreground color “BLACK”; (b) distributed 
“RED” pixels (enlarged) causing erroneous classification 



7 Conclusion 

In this paper, the use of diversity quantization of the HS V color space has been proposed 
to classify colors on archive document images, so as to overcome color variation over 
different images in both the foreground and background colours. The diversity color 
quantisation algorithm combined with a fuzzy color classification algorithm is applied 
to detect a primary color map for each sample image. The primary color map is com- 
pared with those of pre-registered color templates to determine a correctly-labeled color 
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Fig. 13. Biological specimen index card with two foreground colors “BLACK” and “RED”, 
which only represent a small notation 



map for subsequent color segmentation and other document analysis. Initial results of 
experimental evaluation are very encouraging, with over 98% of test images assigned 
the correct color map. Small refinements to the algorithm are proposed to resolve re- 
maining color classification errors. 
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Abstract. This paper introduces an adaptative segmentation system that was de- 
signed for color document image analysis. The method is based on the serializa- 
tion of a k-means algorithm that is applied sequentially hy using a sliding window 
over the image. During the window’s displacement, the algorithm reuses informa- 
tion from the clusters computed in the previous window and automatically adjusts 
them in order to adapt the classifier to any new local variation of the colors. To 
improve the results, we propose to define several different clusters in the color 
feature space for each logical class. We also reintroduce the user into the initial- 
ization step to define the number of classes and the different samples for each 
class. This method has been tested successfully on ancient color manuscripts, 
video images and multiple natural and non-natural images having heavy defects 
and showing illumination variation and transparency. The proposed algorithm is 
generic enough to be applied on a large variety of images for different purposes 
such as color image segmentation as well as binarization. 



1 Introduction 

Every year, great amounts of manuscripts are digitized in the world to preserve cultural 
legacy. Today, some of these images are available on the Internet in online digital li- 
braries. Most of them are manually indexed at great cost, thus the automation of this 
task has become a necessity. Several indexing systems have been already proposed for 
document information retrieval using image analysis tools. Most of them are accurate 
when applied on printed document archives but very few propose a solution to index 
ancient manuscripts. The difficulty to analyze automatically digitized manuscripts is 
related to the complexity of the segmentation. This is due to the bad quality of the doc- 
uments images, and the complexity of their contents. Ancient manuscripts have a very 
rich content [1] and require a multicolor segmentation to extract colored text, illumi- 
nated marks, etc. . . 

Generic algorithms for color image segmentation are difficult to apply on docu- 
ment images for many reasons. Firstly, documents are generally digitized using high 
resolution, which provide large digital images that slow down classical segmentation 
algorithms. It makes it difficult to achieve a global segmentation on the entire image 

S. Marinai and A. Dengel (Eds.): DAS 2004, LNCS 3163, pp. 252-263, 2004. 
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without memory overloads and excessive computing time. On the other hand, images 
of digitized documents have some particular defects that make the use of classical seg- 
mentation algorithm difficult. Figure 1 shows the defects we usually hnd in color im- 
ages of ancient manuscripts like stains, holes, humidity marks, degradation of the ink, 
characters strokes, color variation of the paper, recto/verso transparency. . . 

In most of the image processing tools described in the literature, some restaura- 
tion might be done ahead of the segmentation. The proposed algorithm can be used to 
restaurate and/or segment images. 
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Fig. 1. Samples of color images of ancient manuscripts. 



2 Proposition 

2.1 Serialization of the ^-Means Algorithm 

Document image processing is generally better achieved by adaptative segmentation 
methods. It has been previously shown that adaptative thresholding algorithms, such 
as Niblack’s or Sauvola’s [9], perform better than non-adaptative ones like Otsu’s or 
Fisher’s. It is explained by the specific defects found in document images, such as 
stains, illumination variation and color fading which need to be processed by a highly 
adaptative algorithm. It is generally achieved by considering the local information in 
the neighborhood centered on each pixel using a sliding window. 

The principle of our method is to apply an unsupervised classifier, like the A:-means 
algorithm, on a sliding window so that the segmentation is neighbor-dependent and 
allows the classifier to slightly adapt the clusters to any new local color information. We 
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chose the A:-means algorithm because it is the simplest and most efficient unsupervised 
classiher that can be easily modified and controlled, but other classihers can be used. 

In our algorithm, each point P of the image is labelled by a A:-means classiher that 
is initialized with the clusters’ centers of the previous classihcation and trained on a 
window wp centered on P. The dependence between successive classihcations is jus- 
tihed by the fact that most of the information is similar in overlapping windows. The 
serialization of the A:-means has three properties: 

- An important reduction of the amount of iterations to stabilize the centers. 

- A higher adaptativity of the segmentation as the center of each cluster can move 
swiftly. 

- Every cluster does not have to be represented in each window. 

We choose to serialize the A:-means along the x-axis because most of our images, 
which show open books, have a larger width than the height. Further tests show that 
the adaptation along the y-axis does not provide better results and may lead to an extra 
computation time. At the beginning of each scanline, the centers of the color clusters are 
initialized with the original centers privided by the user (see section 2.3) then, along the 
scanline, we ensure that the centers do not swap during the adaptation (see section 2.4). 



2.2 Choice of the Features 

Our serialized A:-means is independent from the choice of the features and the dimen- 
sion of the feature space. Experiments show that we need the maximal amount of in- 
formation available to separate, for example, some verso characters that are visible by 
transparency from faded recto characters that have approximately the same luminosity 
and hue. 

At hrst we applied the classihcation on the RGB channels as our images are color 
scans and some parts of the text are colored [6]. 

On some images, using the YUV* coordinates system - which is a linear combina- 
tion of RGB - can slightly improve the segmentation accuracy. 

The HSL colorspace (Hue, Saturation, Luminosity) is non-linearly calculated from 
RGB. The Hue channel is the angle of the color on the chromatic circle; therefore we 
implemented the circular distance and circular mean to use this feature adequately. The 
HSL channels are highly relevant, and do boost the segmentation accuracy [5]. 

We also use the standard deviation of the grayscale cooccurrence matrix calculated 
in sliding windows. This feature is generally not suited for ancient manuscript images 
but can improve the results on natural or dithered images. 

Other features such as Euclidean colorspaces like LU*V* [7] or more texture infor- 
mation [8] can be used, since our classifier is designed to be used with feature vectors 
whose dimension is not preset. 

For general purpose we use both the HSL and RGB channels, thus having 6 dimen- 
sionnal feature space. 

^ the PAL television luminance/chrominance 
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2.3 Initialization of the Algorithm 

As our document images are noisy and unequally illuminated, some pixels of a same 
class located at different parts of an image may be very different. Furthermore, some 
classes may have very tight borders (e.g. verso text versus deteriorated recto text), and 
we need to describe those borders quite precisely, that is to define several clusters for 
each logical class. Therefore, we use a hierarchical classification: meta-classes that are 
made of several clusters in the feature space [2, 3]. 

Generally in color document images, color clusters are not Gaussian [4], thus it 
is difficult to determine the amount of clusters that is required. This explains why the 
number of logical classes, the number of clusters for each class and their initial centers 
must be provided by the user. 

In our application, a user interface allows us to set the number of logical classes 
and to select the multiple color samples for each logical class that initialize the original 

centers of the clusters, which we will call < 1 . Those color samples are the 

I ‘ )ie[\...k] 

means of the pixels in rectangles designated in the image by the user. This initialization 
is done only once for all the images from a same book. The amount k of clusters is equal 
to the number of color samples and not the amount M of logical classes. We assume that 
the user provides at least one sample of color for each logical class {k > M). Figure 2a 
shows all the colors from the image figure 4a and the centers of clusters in the three 
dimensional RGB space. There are three classes (M = 3) in this image that are described 
with different amounts of clusters (coi is red text with 1 center, CO2 is black text with 2 
centers and CO3 is the background and transparent verso with 5 centers). 




Fig. 2. a) RGB space and the clusters’ centers, b) Displacement of the clusters’ centers along a 
scanline. 
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2.4 Class Swapping Prevention 



An automatic unsupervised classifier like the k-means algorithm can easily swap the 
cluster centers. If serialized, this creates inversions between the clusters and class labels. 
We prevent our algorithm from centers swapping after the centers’ stabilization in order 
to guarantee the convergence of the classification and the relevance of the labelling. 

To do that, for each window wp, after the training of the k-means and before the 
labelling of P, we compute the distance between each new cluster center C'^’’ to a set 



of reference centers 




If the nearest reference center is not the one with 



the same label, then C'^’' 



is 



moved (back) to the reference center 




Vi G [1 . . .k] . 
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The reference centers, which are re-calculated for each window, are linear^ combina- 
tions of the previous centers \ ) and the original centers < > that 

I ' Ue[i...k] I )ie[\...k] 

were selected by the user in the initialization step. 

ViG [I...k],c^®^"^ = (i-X)-cr^X-cf^^ Xe [0,1] 

The best results are achieved with 0 <X<0.7. 

Such a limitation of the centers displacement around the original centers does not 
restrain the adaptation of clusters. Figure 2b shows the adaptation of the centers of the 
clusters in the RGB space along a scanline of the image given in hgure 4a with reference 
centers equal to the previous centers. 



2.5 Labelling of the Pixels 

In a normal case, a point P is affected the label of the nearest cluster center. This is 
troublesome when dealing with dithered images because an apparent color can be the 
result of the juxtaposition of two different colors. Let us consider an image with two 
colors, Cl and C 2 , and one level of dithering. We’ll try to segment this image into three 
classes. As our algorithm initializes the centers with the mean feature vector calculated 
in windows given by the user, the centers will be of color ci , of color C 2 and 

of color jdf (ci,C 2 ), with p/ the mean function adequated to the feature- vector the 
user chose. If we do not pay attention, no pixel will be classified in the cluster C 3 as the 
only pixels that are present in the image are of color ci or C 2 . This is a problem that no 
common classiher, even a serialized and adaptative one, can overcome. 

Assuming that, in a given window, has been assigned no pixel, if the clusters 
C'l'’ and Cj ^ have approximatively the same amount of samples, we calculate the spatial 
barycenters^ Z?i and B 2 of the pixels of the window that are of color ci and C 2 (as shown 
in figure 3). From this, we can conclude that: 

^ or circular for the Hue feature 
^ i.e. the centroids 
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1 . if Z?i and B 2 are distinct, the window is centered on the edge of two uniform regions 
of color Cl and C2, 

2. if B\ is very close to Bj, the window is centered on a dithered part of the image or 
the window is centered on a particular kind of symmetric pattern. Here, a frequency 
analysis can lead to a more precise separation. 

In case 2, we decide to apply a specihc algorithm to manage dithered images. This can 
lead to the misclassification of some symetric patterns but it is a lesser harm than having 
C3 attributed to no pixel. 

Generally a smoothing algorithm is used to process dithered images but to much 
smoothing destroys the thin strokes of the characters and smears close patterns. 

Therefore we adapt the classification in case 2 by choosing the label of a pixel with 
the following criterion: 

Z = argmin {d{f {P) ,Cf^) ,d{f{fio{wp)) , Cf ^ ) } 

where I is the label given to the pixel P, {C“^} the cluster centers for the window wp, 
f the function we apply to a pixel to obtain its feature vector and (wp) the Gaussian 
mean of wp with a standard deviation o and centered on P. 

The a parameter allows to overcome very local stains by smoothing the classifica- 
tion of the pixels without smoothing the image. Images with noise or dithering are to 
be processed with 0.7 < O < 1.4 to smooth the classification, whereas clean documents 
images must be processed with no smoothing (a < 0.5). 

Another parameter p, allows to control the serialization of the algorithm. It defines 
a distance between a pixel and its closest center from which the pixel will not be used 
to calculate the new center. Therefore, if p = 0, the serialization is blocked and the 
algorithm is just a windowed ^-means. If p — > <=o, the algorithm is fully serialized. 

It is useful to control the serialization when processing dithered images for in that 
case, if fully serialized, the algorithm would be able to adapt to each color of the dither- 
ing and the result will also be dithered. Generally, for dithered images, we use o > 0.5 
and p = 0 (no serialization) and for shaded images having illumination variations, we 
use O < 0.5 and p — > 00 (fully serialized). 




Fig. 3. Spatial barycenters of a dithered image and a shaded image. 
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2.6 Summary 



Let us call / the function we apply to a point P to obtain its feature vector. For each wp 
centered on P, at hrst we calculate the new cluster centers. This is a normal A:-means 
process to which we add the p condition; 



/ 

ViG[l...^],Cr=p; 

V 



f{Q),Q^wp/ 



ar^min d(^f {Q) ,Cj^ ) = i 



where is the center of the cluster number j from the previous window. 

Then, we prevent the centers from swapping by stacking them to reference centers 
if they moved too far: 



yie[i...k ], = (1 - X) • + X • , X G [0, 1] 



Vi G [1 . . .fe] , 



argmin d( i ^ ^ 



We calculate the spatial barycenters of the two most populated clusters: 



a = argmin cardie^) , b = argmin card(C)^^) 



B\=p \ < 2 G wp/ argmin d{f {Q) ,Cf^) = , 






^2 = 1 2 G wp/ argmin d{f {Q) ,€'/'’’) = 

Finally, we classify the point P according to whether the part of the image in wp is 
dithered or not: 



cardc7 



card^ ^ (dithered) ^ 1{P) = argmin {d{f {P) ,C/’^) ,d{f {wp)) ,Cf^)} 



carder^ / i 

cardCj^ (not dithered) Z (P) = argmin cf(/(P) 
or fif(Pl,P2) > e i'G[1...A:] 



3 Results 

3.1 Document Image Segmentation 

We validated our algorithm on ancient manuscripts images from the Mazarine library 
that were provided by the IRHT"*. Those images contain black and red text and illumi- 
nated marks. The verso text is transparently visible on the recto and the characters from 
the recto are partially faded (hgure 4a). 

^ the French Institute of Research on Texts’ History 
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Fig. 4. a) Original ancient manuscript image from Mazarine library, b) Application of Sauvola’s 
algorithm, c) Red text extracted by a serialized A^-means, d) Black text extracted by a serialized 
k-means, e) Restaurated image. 
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A traditional binarization cannot lead to an accurate segmentation of the image. 
Figure 4b shows an application of Sauvola’s algorithm for example. We use the ini- 
tialization described in figure 2. The segmentation in 3 classes follows in figures 4c 
and 4d (the background is not represented here). The parameters are a 6 x 6 window 
size, X = 0.5, O = 0.5 and p = 50000 and the features are the RGB and HSL channels. 
Once filtered, those images can be used as stencils to create an artificial image of the 
document and partially restaurate the image. The background is to be replaced by its 
mean and the red and black layers can be copied as is (figure 4e). 



3.2 Binarization 

We have also studied our algorithm capability to directly binarize color images com- 
pared to other adaptative binarization algorithms, such as Niblack’s and Sauvola’s, ap- 
plied on luminance images. Figure 5b shows Sauvola’s algorithm applied to the lumi- 
nance channel from figure 5a in a 18 x 18 window. It creates residues at the borders 
of the stain. Our algorithm, applied in an only 6x6 window with X = 0, G = 0.5 and 
p = 50000 on the RGB and FISL channels (figure 5c) with two classes, takes advantage 
of the direct analysis of the color features and shows its adaptativity. 



3.3 Video Image Segmentation 

Our algorithm is generic enough to be applied on a large variety of images like noisy 
images from compressed video. Its adaptativity can overcome the MPEG compression 
noise (figure 6). The parameter are X = 0.5, 0=1 and p = 50000 in a 6 x 6 window 
and the features are the RGB and FISL channels. 



3.4 Color Segmentation of Dithered Images 

Figure 7 shows an example of a dithered map that is segmented and therefore indexed. 
This image comes from the MediaTeam DataBase (image “P03048.j pg”). The features 
are the RGB and HSL channels. The parameters are a 6 x 6 window size, X = 0, O = 0.8 
and p = 0. We used the color caption of the map to initialize the clusters. 



3.5 Performance 

By reusing the centers of the previous classification along the scanline, we reduce the 
number of iteration of the A:-means to up to 17% of the amount with a basic windowed 
A:-means. The average number of iteration for each window displacement is situated 
between 2.2 and 3. 

Nevertheless the proposed algorithm, like most of the adaptative segmentation 
methods, is computationally expensive according to the size of the window and the 
dimension of the feature space. The processing of a 3000 x 2000 image with a 6 x 6 
window with 6-dimensions feature vectors, 7, = 0, O = 0.5 and p = 50000 is achieved 
in 500 seconds on 1.5GHz PC. 
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Fig. 5. a) Original ancient manuscript image from Venddme library, b) Binarization achieved by 
Sauvola’s thresholding, c) Binarization achieved by a serialized 2-means algorithm. 



4 Conclusion 

We presented an adaptative segmentation algorithm suited for color document images 
analysis based on the serialization of the A:-means algorithm applied sequentially along 
each scanline. We also proposed to represent each logical class with several clusters in 
the feature space. A small variation of the initialization of the centers of the clusters 
is not critical because of the adaptativity of the algorithm. The number of clusters per 
class and the size of the window have a limited influence on the segmentation results. A 
chart will be produced to help the user in the choice of the X, O and p parameters, which 
can have a heavy impact on the segmentation accuracy and on the processing time. 
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Fig. 6. Application of the serialized k-means to video images. 




Fig. 7. Indexing of a dithered map image. The caption was used to initialize the process. 



The results are quite good on all the ancient manuscript images we have processed 
with optimal parameters (number and position of clusters, window size, features choi- 
ces). We also tested our algorithm on many kinds of images such as video frames, 
natural images and maps. 

We plan to simplify the usage of this algorithm and assist the user to define automat- 
ically some parameters like the number of clusters for each class and the color samples 
for each class. 
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Abstract. This paper presents the result of an adaptive region growing segmen- 
tation technique for color document images using an irregular pyramid struc- 
ture. The emphasis is in the segmentation of textual components for subse- 
quence extraction in document analysis. The segmentation is done in the RGB 
color space. A simple color distance measurement and a category of color 
thresholds are derived. The proposed method utilizes a hybrid approach where 
color feature based clustering followed by detailed region based segmentation is 
performed. Clustering is done by merging image color points surrounding a 
color seed selected dynamically. The clustered regions are then put through a 
detailed segmentation process where an irregular pyramid structure is utilized. 
Dynamic and repeating selection of the most suitable seed region, fitting chang- 
ing local condition during the segmentation, is implemented. The growing of 
regions is done through the use of multiple seeds growing concurrently. The al- 
gorithm is evaluated according to 2 factors and compared with an existing 
method. The result is encouraging and demonstrates the ability and efficiency 
of our algorithm in achieving the segmentation task. 



1 Introduction 

As compared to the binary and gray scale document images, color document images 
contain much richer set of information. The use of varying colors allows the subject 
textual area to be distinguishable from the background and non-subject regions. On 
the one hand, the color attribute provides an additional avenue for the extraction of 
textual components. On the other hand, it also introduces new complexity and diffi- 
culties. First is the variety of color spaces that can be used where each has its pros and 
cons. No single space is general enough for all uses. Second is the distance measure- 
ment problem. Till date there is yet a standard and precise way of measuring color 
distance, which is a crucial parameter for all segmentation tasks. Third is the number 
of unique color points. In a frequently used 24-bit true color image, the number can 
reach 16 millions. This will intensify the processing complexity. Nevertheless, there 
are various proposed methods attempting to overcome the problem and complexity in 
order to benefit from the advantage of using the color attribute in the segmentation 
process. 
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2 Existing Methods 

There are many proposed techniques for color segmentation. As categorized by [1], 
color segmentation can be divided into feature-space based, image-domain based and 
physics based techniques. Feature-based methods focus their attention only on the 
color features where color similarity is the key and only criterion to segment image 
content. Spatial relationship among color is ignored. This has resulted in a problem 
where the segmented regions are usually fragmented. Extra and elaborate post- 
processing is required to retain the compactness of the regions. Image-domain based 
methods belong to a category of methods that take spatial factors into consideration. 
The technique utilizes both color and spatial factors in its homogeneity evaluation. 
Splitting/merging and region growing are two main techniques used in this category. 
The common processing steps are the selection of a seed region, the growing or split- 
ting of regions from this seed point, the merging of homogenous regions and a stop- 
ping criterion for growing or splitting. By nature this is a sequential process where 
each pixel and all its neighbors have to be evaluated. The processing order becomes 
critical at points with the same homogeneity value. The selection of a suitable seed 
region is another problem where the initial selected seed region may dominate the 
growing or splitting process. Although many proposed methods attempt to solve this 
problem by making the best selection, the suitability of a region being a seed point 
does change during the segmentation process. These problems are reported in [1] and 
[2]. Physics based techniques are mainly used to process real scene images where the 
physical models of the reflections properties of materials are utilized. 

Despite the large number of proposed color segmentation algorithms, only a hand- 
ful of them have directly addressed the document image processing domain with the 
focus in text segmentation and extraction. The key requirement in this domain is not 
so much in attempting to find the best approximation in terms of the color features. 
Rather, the emphasis is on how well the segmentation process can achieve the reten- 
tion of major document components (i.e. text or non-text) and at the same time real- 
izes the compactness within each component. The challenge is to have just a suffi- 
cient number of unique colors for the former and minimizing the color uniqueness to 
attain the later. The method proposed in [3] processes colors in the FISI color space. 
Color clustering is applied only to the Hue component based on the concept of local 
maximum where a chain of pointers is constructed pointing to the neighboring Hue 
values with a larger pixel count. Regions are thus segmented by grouping pixels with 
Hue values pointing to a same local maximum. As seen in this method, a very intri- 
cate post-processing stage in performing connected component analysis is required to 
merge fragments of textual region. In [4], the same clustering concept is used in the 
RGB color space. It divides the RGB Cartesian space into multiple numbers of fixed 
size cubes where each cube will hold the occurrence of pixels having the color de- 
fined within the cube. A pointer chain is then constructed by analyzing the 26 poten- 
tial neighboring cubes to locate the local maximum with the highest number of occur- 
ring pixels. It differs from [3] in its performance of clustering in a three dimensional 
space and the addition of a 4* dimension taking the spatial factor into consideration. 
The additional dimension is defined by dividing the image plan into horizontal strips 
where each strip will contain a fix number of image rows. Each bin in the 4* dimen- 
sional space will now contain the pixel occurrence of a specified color range located 
along a certain strip. Although it overcomes some of the problems faced due to the 
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lack of spatial factor, the effectiveness is restricted to the size of the cube and the 
widths of the strips. The accuracy of the segmentation result will also depend on 
where the color space and the image plan are divided. Despite the problem it is an 
efficient method capitalizing on the efficiency and simplicity of using histogram and 
at the same time incorporating the spatial factor in the clustering progress. Another 
conceptually similar system proposed in [5] also attempts to incorporate spatial in- 
formation into a feature-based type of color clustering. 

In view of all the proposed methods, our contributions are in 4 areas. First is in the 
area of color measurement where a simple measurement method in the RGB color 
space is derived as described in section 3. Second is in the area of color quantization 
where an efficient method without the need of a color histogram is proposed in sec- 
tion 4.1. Third is in our region growing method where seeds are selected dynamically 
and repeatedly to suit the best local condition, which avoids the problem of having a 
fixed seed dominating the entire growing process. The problem of sequential process- 
ing encountered by the other region growing methods is also addressed by having 
multiple seeds to grow concurrently. The fourth area is in our use of the irregular 
pyramid structure which differs from the traditional pyramid in that it constructs the 
pyramid from an intermediate level instead of the original base level in pixel format. 
It has greatly enhanced the processing speed. The final contribution is described in 
section 4.2. 



3 Color Space and Distance Measurement 

In color segmentation the RGB color space is most commonly used where each color 
is represented by a triplet red, green and blue intensity. HSI is another common color 
space where a color is characterized by the degree of Hue, Saturation and Intensity 
variance. Another category of color space is based on the CIE color model. The main 
aim of this model is to provide a uniform color spacing that facilitates direct meas- 
urement of color distance. L*a*b* is one of such color space. While selecting a color 
space for image segmentation, the key consideration is the ability to have an accurate 
and efficient way to measure color distance. Color distance is used as a measurement 
of color similarity where pixels/regions satisfying a certain degree of color homoge- 
neity are grouped to form a cluster. In this aspect the CIE L*a*b* color space seems 
to be the most promising where the color distance can be computed directly from the 
Euclidean distance of the Lab coordinates (i.e. delta-E). In spite of this, not many 
proposed methods make use of this color space. This may be due to the complexity of 
its conversion process from the RGB color space and also some controversy in its 
accuracy. In HSI color space, color distance is frequently measured along the individ- 
ual axis separately. Although the Hue component alone can be used to measure color 
similarity as in [6], it is not sufficient for detailed segmentation. Both Saturation and 
Intensity value must also be utilized for finer segmentation results as in [3]. In addi- 
tion to this requirement to analyze the three axes separately, a further complication 
exists when the Saturation value is low where all colors look almost the same despite 
varying Hue value. This is reported both in [1] and [7]. In view of these problems we 
have decided to use the RGB color space. It is efficient because no conversion is re- 
quired. Although it also suffers from the non-uniformity problem where the same 
distance between two color points within the color space may be perceptually quite 
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different in different parts of the space, within a certain color threshold it is still de- 
finable in terms of color consistency. 

In order to analyze the color distance measurement in the RGB color space for the 
definition of color similarity, we have conducted an experiment. The experiment 
starts with a pivot color. It will then randomly generate 620 non-repeating variation of 
color points with the same distance from the pivot color computed by using the same 
distance function (i.e. Euclidean or Manhattan). The color point is a 20x20pixels 
square which is about the size of a 12point character. All colors are then visually 
inspected by 10 human subjects to determine its similarity. Each observer will vote 
for one of the 7 categories as shown in table 1 . This process is then repeated for color 
distance, in the range of 10 to 500 by a step of 10, computed by different distance 
measurements. The final result is obtained by taking the majority vote. The result of 
the experiment reveals that the Manhattan distance is a better distance measurement 
where the generated color points exhibit a more stable visual color similarity. In con- 
trast, the Euclidean distance measurement will produce a wider variation of color 
perception with the same color distance. This finding shows that color is formed by 
the additive of the varying red, green and blue intensity and not so much of the physi- 
cal Euclidean distance between the color points. The various categories of threshold 
limit obtained through the experiment are shown in table 1. It is categorized into 4 
main groups. The first group belongs to those below 71 where the same color is ob- 
served with a very low intensity variance. The second group ranges from 71 to 120 
where the color appears to be from the same color series (e.g. dark/light brown) with 
varying degree of intensity. The third group ranges from 121 to 190 where different 
colors are observed with varying color ranges. Color above 190 becomes quite ran- 
dom and thus is considered as undefined and cannot be interpreted. 



Table 1. Categories of color threshold limits. 



Threshold 


Visual inspection result 


10 to 30 


Same color. 


31 to 70 


Same color, low intensity variance. 


71 to 90 


Same color series. 


91 tol20 


Same color series, low intensity variance. 


121 to 150 


Difference color, small color range. 


151 to 190 


Difference color, wider color range. 


Above 190 


Very random occurring color. 



Based on this experimental result, the following color distance function is derived. 
The function will compute the total absolute variation of the respective RGB values 
between 2 color vectors (i.e. C, and C). The further additive factor cxis to discrimi- 
nate between well distributed color variance among all RGB values and those with 
un-even variance distribution. The former reflects better color consistency than the 
later. If the distance is within the threshold 7) then the 2 colors are considered 
“close”. Otherwise they are treated as 2 unique colors. Although the use of a single 
threshold to determine the “closeness” between two colors may not be the most pre- 
cise way of color measurement in the RGB color space, in our context for text seg- 
mentation it is more than sufficient. In [8], the authors also make use of a human 
perception evaluation of color differences to guide the color clustering process. 
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dist(C. , Cp = r ’+ g ’+ ’+ O' 

r’=|c;-c;| ,^’=|cf-cj| ,&’=|cf-c*| 

a = (\r’-g] + \r’-b] + \g’-b])/3 

[ true, if dist(Ci ,Cj)< Tj 
I false, otherwise 



close{Cj,Cj,Tj) = • 



4 Proposed Method 

Our proposed method is a combination of color feature based and region based color 
segmentation process. The algorithm utilizes a color feature based technique to per- 
form fast segmentation of regions with very close colors in a pre-processing stage. 
Based on the result, region-based growing method is then employed to perform de- 
tailed segmentation of the remaining regions taking both color and spatial factors into 
consideration. 



4.1 Pre-processing Stage 

In a 24-bit true color input image, the number of unique colors will frequently exceed 
half of the image size. Most of these colors are perceptually close and cannot be dif- 
ferentiated by human beings. In our study of the RGB color space, a color variance of 
30 and below will fall into this category. As a result the pre-processing stage will 
attempt to aggregate colors within a boundary of 7'^= 15 surrounding the pivot color as 
a cluster. This will ensure the maximum color distance among all colors inside the 
cluster is within 30. Due to the usually large number of color points, we employ a 
simple and yet efficient way of clustering. The process will loop through the entire 
image. As it moves, pivot color points are identified. A color point with variance 
exceeding 2 times the Tj limit is inserted as a new pivot color. Those within the color 
limit are clustered with the closest pivot color. The efficiency and accuracy of this 
process lie in between fixed partitioning of the color space as in bit-dropping tech- 
nique and the selection of a color seed by giving preference to a bigger region [8]. 
Our proposed method is more efficient without the pre-requisite to build a histogram 
for the selection of a suitable color seed. It is also more accurate than the bit-dropping 
technique by building the pivot color list dynamically as it loops through the image. 
As compared to bit-dropping where a fixed partition is used regardless of the actual 
color distribution, this process will avoid non-existent color points. The final output 
from this stage is a group of pixel clusters having color similarity within a limit of 30. 



4.2 Detailed Segmentation Stage 

This stage will perform a detailed analysis of the resulting clustered regions from the 
pre-processing stage and continue to merge regions having a larger color variance. 
Region growing is used as a means to perform clustering where an irregular pyramid 
structure [9] is used. A pyramid is a data structure holding image data points in sue- 
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cessive levels of reduced resolution. The lowest level is the original input image at 
full resolution. Each successive higher pyramid level will hold a smaller repre- 
sentative data set j. of the lower level L-. As a result L-^j is a proper subset of L- 
where the number of data points on level i+1 (i.e. is less than level i (i.e. Nj). 



L. = {/?. . } where j = 1 to 



h+i = [Ri+i.k } where 



k = 1 to N-, 



< N, 



The creation of each pyramid level will follow 4 main stages. The stage is the 
formation of a pyramid level with the selected surviving regions from the lower level. 
The 2““* stage is the analysis of neighboring relationship among regions where two 
regions are neighbors if their children on the lower level are neighbors. The is the 
selection of a new survivor which forms the candidate for the next higher pyramid 
level. The survivor selection process is based on the determination of local maxima. A 
region survives if it has the highest surviving value among its immediate neighbor- 
hoods. The final stage is the selection of children where each survivor will claim a set 
of neighboring non-survivors as its children. The process will then repeat for the next 
higher pyramid level. The detailed pyramid formation process is shown in our previ- 
ous paper [10]. In the current context we can view the pyramid building process as a 
way to perform image segmentation by growing seed regions. The selection of a sur- 
vivor is equivalent to the selection of a seed region. The claiming of non-survivors by 
the survivor as its children is comparable to the growing of the seed. In contrast to a 
regular pyramid, this effect can be achieved through the use of irregular pyramid 
structure because of its flexibility in survivor selection and the ability to perform 
selective claiming of suitable neighboring non- survivors. Intended segmentation re- 
sult can thus be obtained through a suitable definition of the seed selection criteria 
and the appropriate designation of growing or claiming rules. In our algorithm the 
selecting and the growing of seed regions will follow three basic eligibility criteria. A 
region is eligible to participate in the selection and the growth processes if it satisfies 
all three eligibility criteria as shown below. The selection and the growth of a seed 
region will only occur among regions that exhibit spatial adjacency, color closeness 
and size inferiority. An arbitrary region X is eligible to participate in the process initi- 
ated by a pivot region S, if it is adjacent to the region S (i.e. neighbor). A neighboring 
region will only be considered if the color variance between itself and the pivot region 
S falls within a color threshold T 2 where the superscript ‘c’ denotes color. Finally only 
smaller neighbor in terms of size is evaluated where the superscript ‘a’ represents 
size. 



true, if adjacent(S,X) A 



eligibleiS ,X) = < 



close{S\XfT2{S))A 
S“ >X“ 



false. Otherwise 



Unlike the traditional pyramid structure where the base pyramid level is the origi- 
nal input image in pixel format, our proposed algorithm will begin the pyramid in a 
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region format (i.e. group of pixels). After the initial pre-processing stage, regions with 
small color variance are formed. Connected component analysis is then used to iden- 
tify individual regions from the respective colors. These extracted regions will form 
the base of the pyramid. This change in using the intermediate result avoids the proc- 
essing of the first few pyramid levels which are the most time consuming. After the 
formation of a pyramid base and the determination of neighborhood relationship, the 
survivor/seed selection process will begin. The selection process is based on a surviv- 
ing value. The value is computed by allowing each region to vote for its closest 
neighbor satisfying the 3 eligibility criteria as defined before. The process will then 
determine a local maximum having the largest number of neighborhood’s vote. Re- 
gion that has the largest number of ‘suitable’ neighbors is a good seed candidate that 
will enable maximum “healthy” growing of the seed region into its surrounding. With 
the selected seed, the growing of region will begin. In a traditional irregular pyramid 
construction process [10], this is the child selection stage where each survivor will 
claim a set of suitable surrounding non-survivors as its children. In our context, we 
will treat this stage as the growing of the selected seed regions into the eligible sur- 
rounding neighbors. Instead of having the seed actively growing into its neighbors, 
our algorithm allows the neighbors to take an active role in searching for the most 
suitable seed. A region X will become part of a seed S, if S is the nearest to the region 
X among all eligible seed regions a of X. Region X will evaluate all its surrounding 
seeds a, which it is eligible to evaluate, and merge with one having the lowest color 
variance. This will maximize the color closeness among regions that are merged. 
After this stage the entire process to construct another new pyramid level will repeat. 
A newly formed region ^ on pyramid level i+1 will encapsulate a seed region R.j, 

and a group of non-surviving regions on the lower level i. All the non-survivors 
are guaranteed to be the nearest eligible neighboring regions. 



nearest{S,X) 



min {distiX" ,a")^ y a \eligible(a,X) 
a G neighbouring seeds / survivors of X 



R. 



i-I,k 



{R. J U PP where 



r = 1 to |R, 



i+l,k 



eligible(Rij,PP 
nearestiR. j,PP 



The pyramid construction process will stop when the reduction rate is below 0.1. 
Reduction rate is defined as the decreasing rate in the number of newly created re- 
gions on the next higher pyramid level. Through experiment we observe that 0. 1 is the 
break-even point where any continuation in the growing process beyond this stage 
will not yield any noticeable improvement in the segmentation result. The next stage 
is the text extraction process where the connected components on each respective 
color layers are extracted and their textual status is verified. The verification is done 
through a simple component’s size, width and height consistency check to determine 
its textual identity. 
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4.3 Threshold Derivation 



This section will describe the derivation of the threshold T 2 which is used in the eligi- 
bility function to determine the color closeness of two regions. Two regions are 
“close”, if their color differences fall below this threshold. It is different from Tj used 
in the pre-processing stage (i.e. section 4.1) which is an empirical value. The thresh- 
old T 2 (RjP is dynamically and locally determined by taking the average of all color 
variances between the pivot region R.j and all its neighbors NJR-j). This will enable a 
good adaptation to the varying color contrast conditions across image regions. The 
design of the formula is based on the assumption that regions belong to the same 
image object will have lower color contrast than those in different image objects. 
During the initial growing stage of a region, the majority of the neighboring regions 
may belong to the same object. The threshold will provide an unrestrictive and yet 
steady growth of the regions. As the growing reaches its maturity (i.e. reaching the 
boundary) the majority of the regions with homogenous background will stop at this 
threshold value (i.e. average variance). For the remaining regions with a complex 
background, an upper color limit of is required to avoid excessive over segmenta- 
tion of region. In order to have a suitable upper bound value that can apply to all 
situations is determined globally. As shown below is defined as the overall 
average color variance among all pixels P for the entire image plus the standard de- 
viation among all the variances. In order to have a good estimate for T^, flat regions 
with zero variance and regions whose variance are beyond the 190 threshold as stated 
in table 1 are ignored. 



TiiRij) = min 



J^dist{R:.,N:(R,j) 



-Js 



ImgSize 8cc 



T, = 



\N(R,j)\ 

Tl{ P |p = dist{P^ ~Rp) where 190>v >0 | 
total number of V 



+ stdev(iv) 



5 Experimental Results 

Evaluation of color segmentation is not an easy task. There exists no general method- 
ology to evaluate the correctness of a color segmentation result. The validity of the 
resulting colors and the segmented regions will vary according to the human percep- 
tion and the original intent. As a result the measurement is more qualitative rather 
than quantitative. The existence of the wide variety of color space representations and 
the utilized segmentation techniques also add on to the evaluation complexity. This 
has also led to the difficulty in comparative study among algorithms. Since the ulti- 
mate aim of our proposed algorithm is to achieve color document image segmentation 
in preserving both the major graphical and textual components with the later as the 
focus, we will evaluate the result based on 2 factors. The most important factor is to 
evaluate how well the algorithm can ensure spatial compactness in the segmented 
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Fig. 1. Test sample : logo (a) original image, (b) pointer segmentation, (c) pyramid segmenta- 
tion, (d) zoom-in of image b, (e) pointer text extraction, (f) pyramid text extraction. 
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Fig. 2. Test sample : onions advertisement (a) original image, (b) pointer segmentation, (c) 
pyramid segmentation (d), (e) pointer segmentation color layer 1 and 2, (f) pyramid text extrac- 
tion. 



textual region. The 2““* factor will assess the effectiveness in the retention of the origi- 
nal image content (i.e. non-text) after segmentation. The evaluation method for the 
factor is by counting the number of correctly extracted textual component from the 
respective color layers that is visually recognizable. The 2"‘^ factor is measured by 
visually counting the number of retained major features in the image. For doing a 
comparative study we will use the ‘pointer’ method [4] as described above. Figure 1 
shows the test sample of a logo extracted on the web. After color segmentation and 
text extraction, all recognizable text regions are detected with minor over segmenta- 
tion for the character ‘A’ (i.e. Ic, If). Figure lb shows the segmentation result of the 
‘pointer’ method where only the centre bigger texts (i.e. aitp) are extracted (i.e. figure 
le). All smaller texts along the circular path are classified as noise. As shown in fig- 
ure Id, the ‘pointer’ method has failed to group pixels within the character as a single 
color cluster. As a result, pixels belonging to the same character are fragmented into 
multiple color layers. Without further connected component analysis, each fragment 
of the character will be too small in size to be classified as text. The lack in the color 
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(b) 
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Fig. 3. Test sample : volcanoes advertisement (a) original image, (b) pointer segmentation, 
fc),(d),(e) pointer segmentation color layers, (f) pyramid segmentation, (g) pyramid text extrac- 
tion. 



value continuity for pointer chasing may be the main cause for the failure. Figure 2 is 
a test sample of an advertisement on the web for onions. As shown in figure 2f all text 
regions are extracted correctly by our method. In contrast, the ‘pointer’ method pro- 
duces results as illustrated in figures 2d and 2e showing components on 2 different 
color layers holding the main bulk of the textual contents. In order to better demon- 
strate the degree of fragmentation, these images are used instead of the image after 
text extraction where most of the components are removed as graphic or noise. As 
seen in these 2 images, the textual regions are split into 2 main color clusters. Al- 
though visually the representative colors for the 2 layers are close, both co-exist as the 
color peaks in their own local region in the feature space and thus they are clustered 
as 2 separate color layers. Figure 3 demonstrates the same effect for another test sam- 
ple where the word ‘On’ and the character ‘n’ in ‘volcanoes’ are segmented into mul- 
tiple color layers as shown in figures 3c, 3d and 3e. Our method has proven again to 
be effective with the correct and full extraction as shown in figure 3g. The test sam- 
ple: weather service logo in figure 4e demonstrates the ability of our algorithm in 
preserving major regional contents (i.e. woman and tower) whereas the ‘pointer’ 
method has resulted in both regions almost being absorbed into the background area 
(i.e. figure 4c). This may be due to the problem of using histogram where the color 
pixel count for both regions is small. The possibility of such a weak color becoming a 
local maximum is very low. As such it is absorbed by a neighboring stronger color in 
terms of the area coverage. The last sample in figure 4b shows an image having colors 
that are perceptually very close. This has stretched the ‘pointer’ method to its extreme 
where the result is a single cluster of color (i.e. figure 4d). In contrast, our method 
(i.e. figure 4f) performs satisfactorily. The results after evaluating 38 images accord- 
ing to the 2 factors are shown in table 2 which confirms that our algorithm achieved 
the intended task. For textual components, our method achieved a 84% identification 
rate (i.e. 128/152) as compared to the 70% attained by the ‘pointer’ method. The ma- 
jority of the results are satisfactory except for images with very high color variance 
among the textual fragments where the aim of compactness cannot be obtained. 
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Fig. 4. Two test samples: Weather service logo, Wildlife magazine (a), (b) original image, (c), 
(d) pointer segmentation, (e), (f) pyramid segmentation. 



Table 2. Evaluation result. 



Factors 


Original 


Pyramid 


Pointer 


Text 


152 


128 (84%) 


107 (70%) 


Non-text 


93 


86 (92%) 


72 (77%) 



6 Conclusions 

We have proposed a novel color segmentation technique to be used for text extrac- 
tion. The segmentation is done in the RGB color space. A simple color distance 
measurement (i.e. Manhattan) and category of color thresholds are derived. The pro- 
posed algorithm is divided into two stages. The initial pre-processing stage utilizes a 
dynamic seed selection and clustering process with very close color range. The clus- 
tered regions are then used as an input to the detailed segmentation process (i.e. 
stage) where both color and spatial factors are considered. The segmentation is based 
on a region growing technique. Irregular pyramid is used for the segmentation. It 
differs from other region-growing methods in its selection of seeds and the way re- 
gions are grown. The algorithm is evaluated according to 2 factors and compared with 
the ‘pointer’ method [4]. The results have demonstrated the ability of our algorithm in 
achieving segmentation for text extraction. 
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Abstract. The goal of the international project Memorial is automatic retrieval 
from machine typed paper documents belonging to several classes. In this paper 
the problem of pre-processing and segmentation of scanned archival documents 
is considered. The goal of these processes is to exactly determine text regions in 
the document for further OCR processing. Text regions are initially located for 
each document’s class as XML templates. Then, region-matching algorithm is 
used to precisely locate regions in current document. 



1 Introduction 

A strategic goal of the international consortium undertaking the project Memorial is to 
enable creation of virtual archives based on documents existing in libraries, archives, 
museums, and public record offices [1]. To achieve this goal, the consortium focuses 
on computer-aided information retrieval from machine typed paper documents. Fur- 
ther steps, one including development of data models for the storage of retrieved in- 
formation, and another, involving development of distributed virtual memorial ser- 
vices for navigation and information search, are planned by the consortium upon 
successful completion of the Memorial project. In the first phase of the project the 
archive documents from former Nazi concentration camp museums (e.g. State Mu- 
seum Stutthof in Sztutowo near Gdansk) are considered. 

Creation of digital libraries of archive documents requires solving many different 
problems connected not only with storing and management of original documents but 
also with automatic information retrieval from them. This information stored together 
with images of original documents and possibly some additional information creates 
complex structures called digital documents. Digital documents allow for advanced 
searching possibilities on different security levels for historians, families of victims 
and others. 

Though, there exists several advanced OCR {Optical Character Recognition) sys- 
tems and the structure of the machine typed documents is rather simple, the automati- 
zation of the recognition process is a difficult task bearing numerous problems of the 
science, technical, organizational and law nature [2]. 

Achieving high quality in the information retrieval (recognition) process hardly de- 
pends on applied methods and tools of image processing and recognition. In the Me- 
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modal the professional OCR DOKuStar from Document Technologies GmbH is used. 
Numerous experiments carried out with the OCR programs revealed that the quality of 
recognition of documents’ content strictly depends on proper image preprocessing that 
covers: 

- image acquisition (scanning); 

- document localization (layout) in whole image; 

- optional skew correction and elimination of geometric distortions (mostly in digital 
camera acquired images); 

- background and noise removing; 

- localization of text regions in the document. 

In this paper the research and experiments results are presented that concern archi- 
val documents’ acquisition and pre-processing and also regions’ localization basing on 
predefined documents schemes. 



2 Acquisition of Archival Documents 

The acquisition process of archival documents bears numerous problems of the sci- 
ence, technical, organizational and law nature. In the Memorial project a comprehen- 
sive study was carried out in order to specify correct procedures for high quality in- 
formation retrieval of such documents. 

2.1 Protection of Personal Data 

Protection of personal data contained in archive documents is one of the most impor- 
tant aspects of the information retrieval process. In the Memorial project a set of false 
documents was prepared that could be used in some experiments and for the Internet 
presentation of the project goals and results. These test documents have been made 
using original typewriter and paper sheets what results in their great similarity to 
original patterns (Fig. la). 

However, experiments proved that this similarity is not sufficient at the image pre- 
processing stage. The differences between the false and original documents result 
from several reasons covering e.g. difficulty in simulation of such effects as mechani- 
cal damages of the paper, influence of atmospheric factors (humidity, floods, light), 
chemical reaction of the ink and paper, copies from other pages and many others. 
Example damages of original documents are presented in Fig. lb*. 

2.2 Scanning 

The main goal of documents scanning for the archive purposes is their faithful analog 
to digital conversion. However, in some cases, a traditional scanning may lose some 
information significant from the farther recognition process’s point of view. In order 



* Considering the demand of personal data protection all such information are removed from 
images presented in this paper. 




278 Mariusz Szwoch and Wioleta Szwoch 



to avoid such situation in the Memorial project, additional experiments have been 
carried out with alternative scanning of documents using the infrared light. Images 
scanned in that way are characterized by greater intensity of some image artifacts (e.g. 
copies from other pages) but generally their quality were distinctly worse than images 
scanned in traditional scanners. 

Theoretically, there exist a possibility to use information from both kinds of 
scanned images to improve the recognition results but it requires additional storage 
and processing time bearing additional problem of images matching. Considering this, 
further research in the Memorial project has been done with different types of tradi- 
tional scanners. 




Fig. 1. a) false document from former Nazi concentration camp, b) fragments of original docu- 
ments with example artifacts. 



Optimal scanning resolution for image recognition is usually set in the range of 75- 
400 DPI [3]. The lower boundary of this range results from minimal width of line- 
shape objects occurring in the documents. These lines are expected to have at least 1 
or 2 pixels in width after scanning [4]. Though, counted in this way resolution of ar- 
chive documents (from concentration camp) lies between 50 and 75 DPI, in the Me- 
morial project, the higher resolution of 300 DPI has been set. Using this resolution, 
results from the fact that scanned images are used not only for recognition but are also 
concurrently used for archiving purposes. These images are presented to authorized 
users of digital library instead of original (paper) ones. 

Unfortunately, using high-resolution images increases significantly system requests 
for memory resources and time needed for document processing. For example, single 
A4 image stored as a TrueColor uncompressed bitmap (BMP) has a size of 26MB, 
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and single filter operation (window 3x3) lasts^ about Is. Though, all farther described 
algorithms had been carried out using images of 300 DPI resolution it would obviously 
be possible to reformulate them for documents with lower resolution which could 
increase their speed but decrease further localization accuracy. 




Fig. 2. The life cycle of digital documents. 



3 Preprocessing of Archival Documents 

The automatic document’s recognition process consists of several, characteristic 
stages: image preprocessing, segmentation, recognition and interpretation [2]. These 
stages are parts of the digital document’s life cycle that is presented on Fig. 2. 

The efficiency and quality of the automatic recognition process is a primary condi- 
tion for the final success of the whole project of digital library creation. When the only 



^ All described experiments have been carried out on PC with Athlon 2.5GHz processor with 
512 MB RAM. 
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low recognition efficiency is achieved the only alternative is manual information re- 
trieval from original documents. 

As it was mentioned earlier, the efficiency of the OCR systems depends in high ex- 
tend on proper carrying the initial stages, especially background elimination and seg- 
mentation. The experiments carried with the DOKuStar system, revealed its sensibility 
on correct localization of regions containing the text under recognition. This fact gen- 
erates additional problem of exact text region localization. 

<?xml version=" 1 . 0 " encoding="UTF - 8 " ?> 

<!DOCTYPE page SYSTEM "DTD\page . dtd" [] > 

<page name="singlepagetransportlist" size_width="2552 " 
s i z e_he ight= " 3 5 0 8 " > 

<content origin_x="0" origin y=" 0 "> 

<region foreground_color=" #000000 " background_color="#f f f f f f " 
anchor_top=" 185 " anchor_lef t="260" stretch_width=" 1185 " 
stretch_height=" 265 " stretch_x_tolerance=" 0 " 
stretch v tolerance=" 0 " anchor_x_tolerance=" 0 " 
anchor v tolerance=" 0 " skew_tolerance="0" skew="0" op- 
tional="False"> 

<text font_name="Courier New" charset="ISO- 8859 - 1"> 
<composed_text type=""> 

<line skip="265" length="0" align="unknown" 
val ign= "unknown " ></ 1 ine> 

</composed_text> 

</text> 

</region> 

Fig. 3. Listing of the XML schema fragment describing template regions’ location. 

Most recognition algorithms use some expert’s knowledge that could be found in 
assumed preconditions, models, thresholds parameters and others. In the Memorial 
project the expert’s knowledge is put into the schema mechanism that describes gen- 
eral structure of the documents of certain type with a standard text regions’ localiza- 
tion in it. The schemas are created in XML meta-language and stay for the foundation 
of digital document description. The contents of each document schema fields are 
filled by the OCR algorithms and also by automatic and manual corrections. The ex- 
ample fragment of the document schema is presented on Fig. 3, and location of tem- 
plate regions are presented on Fig. 4 a) (white rectangles). 



3.1 Document Localization 

Some amount of archive documents processed in the Memorial project has been 
scanned with the background of gray-green broadcloth (Fig. 4a). The first stage of 
processing of such documents is their location inside the image. Considering constant 
background characteristic (color) and unchanged light conditions (scanner) the color 
RGB characteristic of background has been estimated. A simple algorithm, samples 
the image from its edges towards the center, every M-th raw and column (M=50). 
Encountering N subsequent points (N=3..10) with color characteristic different from 
the background indicates the edge of document. Though, taking lower M values allows 
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for very precise determination of the document edge (also with cut corners, waved 
edges etc.), it is not important from the further processing point of view. The docu- 
ment layout of the example image (Fig. 4a) is marked as the largest light-gray rectan- 
gle. The presented algorithm is fully efficient and very fast. 




Fig. 4. Original, Nazi camp document with a) marked regions: template (white), moved (dark 
grey) and suited (light grey), h) character segments and regions. 



3.2 Background Separation 

The next step of typescript image’s pre-processing is the background separation. It 
means removing these parts of the image, which do not contain any information that 
could be essential in further recognition process. The background separation leads to 
separation of machine-type characters whereas, graphical elements (such as signatures 
and handwritten notes) are treated as unwanted disturbance and should be eliminated. 
Unfortunately, the low quality of archive documents does not allow for using global, 
and even local, thresholding algorithms [5]. Moreover, interactive, on-line document 
processing does not also allow for using any advanced thresholding method with high 
counting complexity. 

High-Pass Filters 

High-pass filters are commonly used for edges detection particularly enabling 
localization of character segments. In the very first stages of research in the Memorial 
project, modified Sobel filters [3] (mask 3x3) were used with very good results 
(Fig. 5a). Unfortunately, this method failed when applied to original documents, which 
was probably caused by their significant blurring (Fig. 5b). Increasing mask size (5x5) 
had improved results of segment detection (Fig. 5c) but it had also increased filtering 
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time. This prevents high-pass filters from being applied for whole image but they still 
may be used locally to improve thresholding results in doubtful cases. 




Fig. 5. Background separation using: Sobel filters (a - false document, b - original document 
(mask 3x3), c - original document (mask 5x5) and d) - algorithm for colour images. 

Thresholding of Colour Images 

When applying OCR systems for images of good quality it enough to consider mono- 
chromatic or even black-and-white images. However, in archival documents of bad 
quality very important information exists also in chromatic part of image. Resigning 
this information may cause wrong background separation and further, may lead to bad 
information retrieval. 

Further research on this problem, results in new algorithm that bases on full colour 
information (RGB) contained in document image and converts it to monochromatic 
(and B/w as well) image. This algorithm eliminates most of background pixels in the 
image (Fig.5d) basing on 4 criteria that uses several thresholds t, calculated on the 
base of statistical properties of documents’ images. Important fact is that choice of 
these thresholds is not crucial because they have safe error margin for different docu- 
ments. The algorithm uses the following criteria to eliminate background pixels: 

• rejecting of all pixels belonging to the grey-green background of documents (3.1). 
Such points may also occur inside determined document layout, mostly in the case 
of different document’s physical damages, e.g. punch-holes, cut-corners, and torn 
edges, very light paper sheet etc. 
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• intensity J thresholding using 2 safe thresholds of (=50) and (=180) that 

determines points, that certainly do not belong to characters: too dark points 
(J<ljmin) occurs Very rarely because of ink paling and points too bright be- 

longs to the characters’ background. 

• filtering all points that have at least one neighbour pixel to its right that has colour 
document background (green-grey). The goal of this operation is to detect large re- 
gions of document background. Information about these regions may be further 
used for estimating scale of document damage [2] or for locating punch-holes. Us- 
ing one-dimensional filter significantly increased the operation speed though better 
results could be undoubtedly received using 3x3 mask. 

• rejecting ‘colour’ points, it means all points that have a dominant red, green or blue 

colour component. These criterion results from the assumption that the black ink 
was used in typed documents and that fading was nearly equal for its all, three col- 
our components RGB. In the criterion all points are rejected that have any colour 
component greater by (=30) from any other. Taking such high value, 

common for all colour components gives very good result providing high error 
margin. Additional experiments have proved that decreasing individually for 
each colour components could lead for even better background separation results 
except some specifically coloured regions. 



3.3 Detection of Segments 

Determining location of characters that belongs to text regions is the first step of 
document segmentation. Because the main goal of this stage is segment location (not 
recognition) it is possible to use methods that emphasise segments but not always 
preserve their shape. The algorithm worked out for the Memorial project consists of 
two steps: 

• noise removal using logical filter with mask size 7x7. This filter removes all pixels 
that have less than 5 neighbours inside the mask window centred on them. Though 
filter „square” radius is rather high it filtering doesn’t consume too much time be- 
cause it is made only for pixels separated from the background. 

• Morphological dilation of the image with rectangular (9x3) structural element that 
connect broken parts of the characters’ segments. 

Finally, the algorithm returns a list of segments that represent single or connected 
machine-typed characters (Fig. 4b). 

3.4 Matching of Regions 

The aim of the regions matching algorithm is to establish the real location of template 
regions for the processed document. This goal may be achieved by two means: 

1 . Finding the real location of the left upper corner of the template region preserving 
its original size. Such regions for example image (Fig. 4) are marked with dark- 
grey (blue) rectangles. 
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2. Finding the smallest rectangle that contains all characters of the region. Such re- 
gions for example image (Fig. 4) are marked with light-grey (yellow) rectangles. 

The algorithm used to match text region founds all segments with proper size that 
fully lay inside the template region or intersects it having at least =40% of their 

size inside it. The minimal segments’ dimensions depends on font size and scanning 
resolution and for concentration camp documents have been established as l^i„segx ~ 

^minsegy = 15 pixels. 

The algorithm’s assumption demands that the template regions cover or intersect all 
characters belonging to it. The examples of such regions are marked with 1 and 2 on 
Fig. 4. If the above requirement is not fulfilled, the algorithm extends the region’s size 
that it could cover all characters laying nearby, but it does not have enough informa- 
tion to continue this process to cover farther rows or columns. The example of such 
situation is marked by 3 on Fig. 4. Regions extending by subsequent rows and columns 
demands knowledge about the whole document structure and semantic analysis of 
neighbouring segments (e.g. whether they construct row or column in a table). Such 
semantic analysis goes beyond the algorithm’s assumptions, however, it will be devel- 
oped in further research under the framework of the Memorial project. 



4 Summary 

In this paper, the problems of pre-processing and segmentation of bad quality docu- 
ments are described. Several algorithms for document localization, background sepa- 
ration, segmentation and regions matching are presented. All algorithms have been 
verified for over 50 archival documents belonging in to 4 different classes (e.g. trans- 
port lists). All documents (even within the same class) have differed in regions size 
and structure. The results achieved confirm very good efficiency of background sepa- 
ration algorithm. This confirms obvious thesis that for bad quality documents all 
available information should be used including chrominance. Good results were also 
achieved for regions’ matching algorithm under the assumption that template regions 
cover all characters belonging to it. In other case the matching is fragmentary or 
missed. 

High speed of presented algorithms allows for interactive on-line work with proc- 
essed documents. In such case the user is able to manually correct regions boundaries. 
Special automatic measures could allow this task by pointing problematic areas [2] . 

In spite of good results achieved for presented algorithms some other high level 
methods for regions matching should be search. These advanced methods, using 
knowledge about the document structure, should deal with difficult cases of regions 
matching. The research in that direction, including linguistic approach [4], will be 
further developed in the Memorial project. 
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Abstract. The study of multiple classifier systems has become an area 
of intensive research in pattern recognition recently. Also in handwriting 
recognition, systems combining several classifiers have been investigated. 
Recently, new methods for the generation of multiple classifier systems, 
called ensemble methods, have been proposed in the field of machine 
learning, which generate an ensemble of classifiers from a single classifier 
automatically. In this paper a new ensemble method is proposed. It is 
characterized by training each individual classifier on a particular writing 
style. The new ensemble method is tested on two large scale handwritten 
word recognition tasks. 

Keywords: handwritten word recognition, ensemble method, writing 
style, hidden Markov model (HMM) 



1 Introduction 

The field of off-line handwriting recognition has been a topic of intensive research 
for many years. First only the recognition of isolated handwritten characters was 
investigated [29], but later whole words [28] were addressed. Most of the systems 
reported in the literature until today consider constrained recognition problems 
based on vocabularies from specific domains, e.g. the recognition of handwrit- 
ten check amounts [15] or postal addresses [17]. Free handwriting recognition, 
without domain specific constraints and large vocabularies, was addressed only 
recently in a few papers [18, 24]. The recognition rate of such systems is still low, 
and there is a need to improve it. 

The combination of multiple classifiers was shown to be suitable for improving 
the recognition performance in difficult classification problems [20,30]. Also in 
handwriting recognition, classifier combination has been applied. Examples are 
given in [2,22,31]. Recently methods for the generation of multiple classifier 
systems, called ensemble methods, have been proposed in the field of machine 
learning, which generate an ensemble of classifiers from a single classifier [5] 
automatically. Given a single classifier, the base classifier, a set of classifiers 
can be generated by changing the training set [3], the input features [14], the 
input data by injecting randomness [7], or the parameters and the architecture 
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Table 1. The new ensemble method 

Input: The original training set T, base classifier C, and the desired number of 
classifiers, n 
Output: n classifiers. 

A clustering algorithm is applied on T, resulting in n clusters ; 
for(i:=l;i<=n;i++) 

Ti contains all elements of the i-th cluster; 

classifier C is trained on Ti, resulting in classifier Ci\ 
return Ci, ■ ■ ■ ,Cn', 

of the classifier [26] . Another possibility is to change the classification task from 
a multi-class to many two-class problems [6] . Examples of widely used methods 
that change the training set are Bagging [3] and AdaBoost [8] . Random subspace 
method [14] is a well-known approach based on changing the input features. A 
summary of ensemble methods is provided in [5] . 

Although the popularity of multiple classifier systems in handwritten recog- 
nition has grown significantly, not much work on the use of ensemble methods 
has been reported in the literature. An exception is [25], but this paper addresses 
only the classification of isolated characters, while the focus of the present paper 
is on the recognition of cursive words. Some papers of the authors addressed the 
use of ensemble methods for handwritten word recognition (e.g. [10, 12]). 

In this paper we propose a new ensemble method. This ensemble method 
specializes the individual classifiers on different writing styles. In Section 2 the 
new ensemble method is described in detail. Section 3 contains information about 
the handwritten word recognizer, which is used as the base classifier in the 
experiments. Those experiments are then discussed in Section 4 and, finally, 
conclusions are drawn in Section 5. 



2 New Ensemble Creation Method 

In this section the new ensemble method is introduced. The basic algorithm of 
the ensemble method is given in Table 1. First, the training set is divided into n 
disjoint subsets, using a clustering algorithm. Then each classifier is trained on 
the elements of a cluster. This results in n different classifiers. There are two key 
parameters of the ensemble method: the clustering algorithm and the number 
of clusters, n. The novel idea of the paper is that we want to specialize the 
classifiers on different writing styles. The intuitive motivation of this idea is that 
the models of the words may not be able to handle too many different writing 
styles and so inadequately trained models could be produced. By reducing the 
number of writing styles, i.e. by focusing on individual clusters, the quality of 
the models may increase. This is in contrast to most classical ensemble methods 
where one selects words randomly for the training of the classifiers, like Bagging 
[3], or where the selection is done according to how difficult it is for the system 
to recognize a particular word, like AdaBoost [8]. 
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The clustering algorithm should produce clusters containing words of similar 
writing style. In order to execute this clustering, some features that characterize 
a particular handwriting style must be defined. As we are using the data of 
the lAM database [23], we always have complete pages of handwritten text at 
our disposal (compare Section 4). Therefore the features used in this paper are 
extracted from a whole page, and all words of the page belong to the same 
cluster. This means that in fact not the words, but the pages are clustered. The 
features used for the clustering are the following: 

— Words per component (WPC): The average number of words per con- 
nected component is calculated over the whole page. All components with a 
bounding box smaller than a given threshold (500 pixels in the experiments 
reported in Section 4) are ignored, as they are likely to be noise or parts of 
characters which are always separated from the main part of the character, 
e.g. the dot of the character “i” . Words consisting of single characters, like 
punctuation, and numbers are also not considered. A value of one of this 
feature corresponds to a complete cursive handwriting, i.e. the case where 
one connected component always represents exactly one word. By contrast, 
words consisting of isolated hand-printed characters with a number of char- 
acters of n have the value ^ for this feature. As the clustering procedure is 
applied on the training set and as the ground truth of each page is known, 
the feature is simply calculated by dividing the number of words by the 
number of connected components. 

— Character width (CW): The average width of the characters is calcu- 
lated over the whole page. Again single characters are not considered and 
components with a bounding boxes smaller than a threshold are ignored. To 
calculate the width of the characters the bounding boxes of the words are 
used and so the white spaces between the words are not taken into account. 
To calculate the values of this feature the lines of each page are segmented 
into words using the procedure presented in [33] . The character width is then 
calculated as the sum of the lengths of the words of a page divided by the 
number of characters. 

There are three possible feature sets: Both features alone (WPC,CW) and the 
feature set consisting of both features (WPC/CW). In previous experiments the 
feature set CW did not produce good results, so only the other two feature sets 
will be used in the experiments of this paper. 

Please note that not all features that are meaningful can be used in the actual 
system. For example, slant is a very powerful feature for humans to discriminate 
different handwriting styles. Yet in our system, slant normalization is done, which 
makes the slant nearly vertical for all words. 

For clustering the words, the fc-means clustering algorithm [16] with the 
features described above was executed, where the features were first linearly 
normalized so that the mean of the features was 0 and the standard deviation 
was 1. 

When increasing the number of clusters, n, which is equal to the number of 
produced classifiers, the diversity of the writing styles in a cluster decreases and 
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better models for those words can be built. On the other hand, a higher number 
of clusters also means that fewer words are available to train the models, and 
the parameters of the models are poorly estimated when the number of clusters 
is too high. The optimal number of clusters, n, is apriori unknown and must be 
empirically determined. 

3 Handwritten Word Recognizer 

The basic handwritten text recognizer used in the experiments of this paper 
is similar to the one described in [24]. It follows the classical architecture and 
consists of three main modules: the preprocessing, where noise reduction and 
normalization take place, the feature extraction, where the image of a handwrit- 
ten text is transformed into a sequence of numerical feature vectors, and the 
recognizer, which converts this sequence of feature vectors into a word class. 

The first step in the processing chain, the preprocessing, is mainly concerned 
with text image normalization. The goal of the different normalization steps 
is to produce a uniform image of the writing with less variations of the same 
character or word across different writers. The aim of feature extraction is to 
derive a sequence of feature vectors which describe the writing in such a way 
that different characters and words can be distinguished, but avoiding redundant 
information as much as possible. In the presented system the features are based 
on geometrical measurements. At the core of the recognition procedure is an 
HMM classifier. It receives a sequence of feature vectors as input and outputs a 
word class. In the following these modules are described in greater detail. 

3.1 Preprocessing 

Each person has a different writing style with its own characteristics. This fact 
makes the recognition task complicated. To reduce variations in the handwritten 
texts as much as possible, a number of preprocessing operations are applied. The 
input for these preprocessing operations are images of words extracted from the 
database described in [23, 33]. The images were scanned with a resolution of 300 
dpi. Please note that the system is optimized for this resolution. In the presented 
system the following preprocessing steps are carried out: 

— Skew Correction: The word is horizontally aligned, i.e. rotated, such that 
the baseline is parallel to the a;-axis of the image. 

— Slant Correction: Applying a shear transformation, the writing’s slant is 
transformed into an upright position. 

— Line Positioning: The word ’s total extent in vertical direction is normalized 
to a standard value. Moreover, applying a vertical scaling operation the 
location of the upper and lower baseline are adjusted to a standard position. 

An example of these normalization operations is shown in Fig. 1. For any 
further technical details see [24]. Please note that there is no normalization of 
the stroke width of the handwriting. 
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Fig. 1. Preprocessing of the images. From left to right: original, skew corrected, slant 
corrected and positioned. The two horizontal lines in the right most picture are the 
two baselines, which are automatically detected 

Window 

' I 

1 1 i 1 

Fig. 2. Illustration of the sliding window technique. A window is moved from left 
to right and features are calculated for each position of the window. (For graphical 
representation purposes, the window depicted here is wider than one pixel.) 



3.2 Feature Extraction 

To extract a sequence of feature vectors from a word, a sliding window is used. 
The width of the window used in the current system is one pixel and its height 
is equal to the word’s height. The window is moved from left to right over each 
word. (Thus there is no overlap between two consecutive window positions.) 
Nine geometrical quantities are computed and used as features at each window 
position. A graphical representation of this sliding window technique is shown 
in Fig. 2. 

The first three features are the weight of the window (i.e. the number of 
black pixels), its center of gravity, and the second order moment of the window. 
This set characterizes the window from the global point of view. It includes 
information about how many pixels in which region of the window are, and how 
they are distributed. The other features represent additional information about 
the writing. Features four and five define the position of the upper and the lower 
contour in the window. The next two features, number six and seven, give the 
orientation of the upper and the lower contour in the window by the gradient 
of the contour at the window’s position. To compute the contour direction, the 
windows to the left and to the right of the actual window are used. As feature 
number eight the number of black-white transitions in vertical direction is used. 
Finally, feature number nine gives the number of black pixels between the upper 
and lower contour. Notice that all these features can be easily computed from 
the binary image of a text line. However, to make the features robust against 
different writing styles, careful preprocessing, as described in Subsection 3.1, is 
necessary. 

To summarize, the output of the feature extraction phase is a sequence of 
9-dimensional feature vectors. For each word to be recognized there exists one 
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such vector per pixel along the x-axis, i.e. along the horizontal extension of the 
considered word. 



3.3 Hidden Markov Models 

Hidden Markov models (HMMs) are widely used in the field of pattern recog- 
nition. Their original application was in speech recognition [27]. But because 
of the similarities between speech and cursive handwriting recognition, HMMs 
have become very popular in handwriting recognition as well [21]. 

When using HMMs for a classification problem, an individual HMM is con- 
structed for each pattern class. For each observation sequence, i.e. for each se- 
quence of feature vectors, the likelihood that this sequence was produced by an 
HMM of a class can be calculated. The class whose HMM achieves the highest 
likelihood is considered as the class that produced the actual sequence of obser- 
vations. For further details and an introduction to HMMs see [27], for example. 

In word recognition systems with a small vocabulary, it is possible to build an 
individual HMM for each word. But for large vocabularies this method doesn’t 
work anymore because of the lack of enough training data. Therefore, in our 
system an HMM is built for each character. The use of character models allows 
us to share training data. Each instance of a character in the training set has an 
impact on the training and leads to a better parameter estimation. 

To achieve high recognition rates, the character HMMs have to be fitted to 
the problem. In particular the number of states, the possible transitions and the 
type of the output probability distributions have to be chosen. 

The number of states of the HMMs were optimized by the Quantile method 
[34] for each character individually. As a result, each individual character model 
has a different number of states. Because of the left to right direction of writing, 
a linear transition structure has been chosen for the character models. From 
each state only the same or the succeeding state can be reached. Because of the 
continuous nature of the features, probability distributions for the features are 
used. Each feature has its own probability distribution and the likelihood of an 
observation in a state is the product of the likelihoods calculated for all features. 
This separation of the elements of the feature vector reduces the number of free 
parameters, because no covariance terms must be calculated. The probability 
distribution of all states and features are assumed to be Gaussian mixtures. The 
training method of the classifier was optimized on a validation set, using an 
optimization strategy described in [9] . 

To model entire words, the character models are concatenated with each 
other. Thus a recognition network is obtained (see Fig. 3). Note that this network 
doesn’t include any contextual knowledge on the character level, i.e., the model 
of a character is independent of its left and right neighbor. In the network the 
best path is found with the Viterbi algorithm [27]. It corresponds to the desired 
recognition result, i.e., the best path represents the sequence of characters with 
maximum probability, given the image of the input word. The architecture shown 
in Fig. 3 makes it possible to avoid the difficult task of segmenting a word into 




292 



Simon Gunter and Horst Bunke 



lexicon words character models 

d 




Fig. 3. Concatenation of character models yields the word models 



individual characters. More details of the handwritten text recognizer can be 
found in [24]. 

The implementation of the system is based on the Hidden Markov Model 
Toolkit (HTK), which was originally developed for speech recognition [32]. This 
software tool employs the Baum- Welch algorithm for training and the Viterbi 
algorithm for recognition [27]. The output of the HMM classifier is the word 
with the highest rank among all word models together with its score value. 

4 Experiments 

For isolated character and digit recognition, a number of commonly used data- 
bases exist. However, for the task considered in this paper, there exists only 
one suitable database to the knowledge of the authors, holding a sufficiently 
large number of words produced by different writers, the lAM database [23]. 
Consequently this database was used in the experiments. An important property 
of the lAM database is that it consists of whole pages of handwriting, each 
typically holding more than 50 words. This allows to extract very stable values 
of the features mentioned in Section 2. 

Two different handwriting recognition tasks were considered: A writer de- 
pendent and a writer independent task. In the first task a data set of 18,920 
words was used. A total of 116 writers contributed to this set. The set of 18,920 
words was three times randomly split in a training set of 17,920 words and a test 
set of I’OOO words. Please note that that the writers of the training words are 
the same as the writers of the test words which implies that the experimental 
setup is writer dependent. The recognition rate was calculated by averaging over 
the system’s recognition rates obtained for each of the three splits. In the second 
task the whole data set for the first task is used as training set. An additional 
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data set of 3,264 words, contributed by 37 writers, was used as test set. The set 
of writers of the training set and the set of writers of the test set are disjoint, so 
the experimental setup is writer independent. It should be noted that the first 
task was also used to optimize the number of states of the HMMs and to opti- 
mize the training method of the base classifier. As these optimization steps are 
applied to the base classifier rather than to the ensemble they are not expected 
to favor the ensemble over the base classifier. 

To combine the individual classifiers of the ensembles the following combi- 
nation schemes were applied: 

1. Maximum score (score): Only the top choice of each classifier is considered. 
The result of the classifiers with the highest score value is used as the result 
of the classifier ensemble. 

2. Voting scheme (voting): Again, only the top choice of each classifier is con- 
sidered. The word class that is most often on the first rank is the output of 
the combined classifier. Ties are broken by means of the average rule (ave), 
maximum rule (max), the minimum rule (min) and the median rule (med), 
which are only applied to the competing word classes. The four voting rules 
mentioned above decide for the word class which has the highest average, 
maximum, minimum or median score, respectively. 

3. Special combination scheme for HMM-based recognizers (special): This com- 
bination was introduced in [11]. It integrates all HMMs of the different clas- 
sifiers that correspond to the same character into a single HMM. 

The results obtained for the first task are shown in Table 2. The recognition 
rate of the base classifier was 80.03 %. The columns feature sets and n denote 
the underlying feature set and the number of classifiers, respectively. The best 
result of each feature set is marked in bold. Please note that all four versions 
of the voting scheme are the same for n = 2 and n = 3. Therefore the results 
of only one version is shown. In addition the score combination scheme is the 
same as the voting scheme for n = 2, so only the results of the score scheme is 
depicted. 

The results of the special combination scheme are clearly superior to those 
of the other schemes. In 7 out of 8 cases the best recognition rate was obtained 
using this scheme. Both the special and the score combination schemes were 
consistently superior to the voting scheme. Using the special scheme the perfor- 
mance of the base classifier was increased by 2.84 % for {WPG} and 3.13 % for 
{WPC,CW}. It seems that both feature sets are quite suited for the new ensem- 
ble method. The optimal number of classifiers is 3 for the considered application. 
Please note that this number is rather small. The reason for the small optimal 
number of n is that the base classifier contains a high number of free parameters 
which are poorly estimated when using more than four clusters. The increase 
of the performance is not based on the higher number of HMM parameters, 
as the number of parameters of the base classifier was (implicitly) optimized 
by optimizing the number of states and the number of Gaussians per Gaussian 
mixture. 
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Table 2. Results of the new ensemble method for the first task. The performance of 
the base classifier is 80.03 %. The number of clusters is given in the column n 



□ 




1 feature sets | 


Q 


combination 


{WPG} 


{WPC,CW} 


a 


score 


81.7 % 


81.73 % 


a 


special 


82.4 % 


82 % 


a 


score 


82.6 % 


82.63 % 


a 


special 


82.56 % 


83.17 % 


a 


voting 


82.03 % 


81.83 % 


T 


score 


82.63 % 


82.1 % 


4 


special 


82.87 % 


82.87 % 


4 


voting ave 


81.73 % 


79.97 % 


4 


voting max 


81.63 % 


80.06 % 


4 


voting min 


81.63 % 


79.67 % 


4 


voting med 


81.77 % 


79.67 % 


■5 


score 


81.8 % 


82.33 % 


5 


special 


82.43 % 


82.67 % 


5 


voting ave 


80.5 % 


80.57 % 


5 


voting max 


80.6 % 


80.6 % 


5 


voting min 


80.47 % 


80.47 % 


5 


voting med 


80.47 % 


80.47% 



Table 3. Results of the ensemble method for the second task. The performance of the 
base classifier is 80.48 %. The number of clusters is given in the column cn 



feature set 


cn 


performance 


{WPG} 


3 


80.48 % 


{WPG} 


4 


80.73 % 


{WPC,CW} 


3 


81.07 % 


{WPC,CW} 


4 


80.64 % 



For the second task only the parameter values which produced good results 
in the first task were used. Therefore only cases where we have 3 or 4 classifiers in 
the ensembles were tested and the ensemble were always combined by the special 
scheme. Table 3 shows the results of the ensemble method for the second task. 
The recognition rate of the base classifier which is trained on the full training 
set was 80.48 %. 

The performance of the ensemble method was always better or the same as 
the performance of the base classifier. The obtained increases in the recognition 
rate are rather low when compared to the results in Table 2. Please note that 
for the best configuration in Table 2, i.e. 3 clusters and feature set {WPC,CW}, 
also the best result in Table 3 was achieved. With these parameter values a 
performance increase of 0.59 % in respect of the base classifier has been obtained. 
This increase is statistically significant using a significance level of 15 %^. 

^ Where the null-hypothesis is that both the base classifier and the ensemble have the 
same performance. 
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The good results of the first and the modest results of the second task in- 
dicate that the approach works best when the writer of the test word already 
contributed data to the training set, i.e. the system already “knows” the writing 
style of the test word. 

5 Conclusions 

A new ensemble method was introduced where each classifier is specialized on 
a different writing style. Two different handwritten word recognition tasks were 
considered. In the first task a writer depended setup was used where the writers 
of the tested words also contributed words to the training set. The second task 
is writer independent where the writers of the training words did not contribute 
words to the training set. The ensemble method achieved very good results for 
the writer dependent setup. The performance of the base classifier was increased 
up to 3.13 %. The performance gain for the writer independent setup was modest, 
but still the ensemble method did always better than the base classifier or had 
the same performance. In the best case an improvement of 0.59 % was achieved. 
It seems that the new ensemble method is particually good for writer dependent 
setups. 

Future research will focus on the use of other feature sets for the clustering 
of the training elements. For example some of the features introduced in [13] 
may be used. Additionally, the use of different values for the HMM and training 
parameters of the individual classifiers may be investigated. Such parameters are 
for example the number of states and the training method. The values of those 
parameters were optimized for the full training set and may be suboptimal when 
using smaller training sets. 
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Abstract. The design and performance of a content-based information 
retrieval system for handwritten documents is described. System index- 
ing and retrieval is based on writer characteristics, textual content as well 
as document meta data such as writer profile. Documents are indexed 
using global image features, e.g., stroke width, slant, word gaps, as well 
local features that describe shapes of characters and words. Image index- 
ing is done automatically using page analysis, page segmentation, line 
separation, word segmentation and recognition of characters and words. 
Several types of queries are permitted: (i) entire document image; (ii) a 
region of interest (ROI) of a document; (iii) a word image; and (iv) tex- 
tual. Retrieval is based on a probabilistic model of information retrieval. 
The system has been implemented using Microsoft Visual C-|— I- and a 
relational database system. This paper reports on the performance of the 
system for retrieving documents based on same and different content. 



1 Introduction 

Methods for indexing and retrieval of scanned handwritten documents are needed 
for various applications such as historical manuscripts, scientific notes, personal 
records, as well as criminal records. In each of these applications there is a 
need for indexing and retrieval based on textual content as well as user-indexed 
terms. In the forensic application there is a need for searching a database of 
handwritten documents not only for textual content but also for visual content 
such as writer characteristics. This paper describes the indexing and retrieval 
aspects of a system that attempts to provide the full range of functionalities for 
a digital library of handwritten documents. 

Writer identification has a long history perhaps dating to the origins of hand- 
writing itself. Classic forensic handwriting examination is primarily based upon 
the knowledge and experience of the forensic expert. There exist many textbooks 
[1-5] describing the methodology employed by forensic document examiners. A 
computer system for retrieving handwritten documents from a set of documents 
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of forensic interest, known as the Forensic Information System for Handwriting, 
or the FISH system [6], has been developed by German law enforcement. Also 
motivated by the forensic application, a handwritten document management 
system, known as CEDAR-FOX [7, 8] , has been developed at CEDAR, whose 
indexing and retrieval aspects are the subject of this paper. 

1.1 CEDAR-FOX System 

As a document management system for handwritten documents, CEDAR-FOX 
provides several functionalities: interactive document analysis, image indexing 
to create a digital library for content-based retrieval, and use as database man- 
agement system. For the purpose of indexing based on writer characteristics, 
features are automatically extracted after several image processing functions 
followed by character recognition. The user can use interactive graphical tools 
to assist in obtaining more accurate writer characteristics. A unique aspect of 
CEDAR-FOX is that image matching is driven by probabilistic writer verifica- 
tion, i.e., whether the query and the document were likely to be written by the 
same writer. 

Several database management tools for creating a handwritten document li- 
brary are provided: (i) entering document meta-data, e.g., identification number, 
writer and other collateral information, (ii) creating a textual transcript of the 
image content at the word level, and (iii) including automatically extracted doc- 
ument level features, e.g., stroke width, slant, word gaps, as well as finer features 
that capture the structural characteristics of characters and words. The system 
can be customized to use any commercial or non-commercial database system 
for the digital library storage. It also provides access and retrieval functionali- 
ties for adding, modification and categorization of the document records in the 
digital library. 

Information retrieval can be performed using several query modalities: (i) 
the entire document image is the query; (ii) partial image query: a region of 
interest (ROI) of a document or a word image; (iii) text: the user can type in 
keywords from the words in the documents, and (iv) meta-data: case number, 
person names, time and the pre-registered keywords such as brief descriptions 
of the case. 

1.2 Organization of Paper 

Section 2 describes the indexing aspects of CEDAR-FOX which has two parts: 
image features and meta data. Section 3 describes the retrieval aspects of the 
system. Section 4 shows the performance of the system with the original micro 
and macro features as well as with a new set of cognitive features. Concluding 
remarks are presented in Section 5. 

2 Indexing 

Indexing of handwritten document images does not only involve textual informa- 
tion like in the IR counterpart but includes document/ writer features and also 
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meta data regarding the origin of the document, writer profile, type of writing 
instrument used, etc. Section 2.1 describes the image features and their use in 
document retrieval. 

2.1 Image Features 

Image features are computed at the document and the character levels. The 
document level features are called macro features whereas the character level 
features are called the micro features. The initial 12 macro features reported in 
[7] were truly at the document level, meaning they were computed either directly 
from the entire document (no. of black pixels, threshold, etc.) or on a line-by- 
line basis and applied to the entire document (no. of exterior contours, no. of 
interior contours, slope, etc.). We report new macro features that can be used 
to index images which are holistic - they are computed at the character level 
and applied to the entire document and normalized. The micro features are the 
character GSC features which are based on the character’s gradient, structure 
and concavity properties. 



Micro- features. The micro-features used in CEDAR-FOX consist of 512 bits 
corresponding to gradient (192 bits), structural (192 bits), and concavity (128 
bits) features. Each of these three sets of features relies on dividing the scanned 
image of the character into a 4*4 region. The gradient features capture the stroke 
flow orientation and its variations using the frequency of the gradient directions, 
as obtained by convolving the image with a Sobel edge operator, in each of 12 
directions and then thresholding the resultant values to yield a 192-bit vector. 
The structural features representing the coarser shape of the character capture 
the presence of corners, diagonal lines, and vertical and horizontal lines in the 
gradient image, as determined by 12 rules. The concavity features capture the 
major topological and geometrical features including direction of bays, presence 
of holes, and large vertical and horizontal strokes. All the 512 binary features 
are converted from the original floating number computations. 



Macro-features. Macro-features in CEDAR-FOX [7] represent the entire doc- 
ument. The current implementation of CEDAR-FOX consists of three sets of 
features: darkness, contour and averaged line-level features. The darkness fea- 
tures, in turn, consist of three features all obtained from the histogram of the 
gray-scale values in the scanned document image: the number of black pixels 
in the image, the gray-scale value corresponding to the valley in the histogram 
that separates the foreground pixels from the background pixels (known as the 
threshold) and the entropy of the histogram (which is a measure of uncertainty 
in the distribution). The contour features, six in number, are as follows: the 
number of components and holes (as measured by the number of interior and 
exterior contours in the chain-code outline of the handwriting), and slopes in 
the vertical, negative, positive and horizontal directions. The averaged line-level 
features consist of average slant and height of characters. 
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We also develop a new set of cognitive document-level features that capture: 
(i) the legibility of a handwritten document at character and word level, (ii) the 
relative proportions existing between the characters, and (iii) the distribution of 
writing-styles (lexemes) existing in the handwriting of a writer. Some of these 
features as well as some of the original macro features are illustrated in Fig.l. 




Fig. 1. Macro features used in current and future version of CEDAR-FOX 



2.2 Meta- data 

Storing indexes for handwritten documents also involves use of data related to 
the origin of the document, document identification number, style of writing 
(cursive, handprint) etc., making the task of retrieval much more easy and flexi- 
ble for the user. For the use of questioned document examiners, who are potential 
users of such a retrieval system, we decided on the following set of meta-data 
that describe each document. 

1. Case Number - The current document under the case it was associated with 

2. Date of Writing - When the document was written 

3. Date of Input - Automatically obtained from the system time 

4. Writing instrument - Bail-point pen, magic marker, etc. 

5. Geographical Area - Area associated with the case or where the document 
came from 

6. Handedness - Handedness of writer, if known 

7. Keywords - Any keywords the user wants to associate with the document 
e.g. Extreme, Threat, Kidnap, etc. 

8. Suspect’s First & Last Names 

9. Victim’s First and Last Names 

10. Transcript - Transcript for the document obtained either by truthing or a 
text transcript 

11. Path - Path of the file originally stored on disk 

12. Image - The scanned document image itself. 
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The user can store the metadata using our dialog interface to the database, 
which is activated when a document is open as shown in Fig. 2. The feature- 
based indexes are computed when the document is opened and are automatically 
stored into the database along with the meta-data. Thus the feature extraction 
process, which is an online computation task, is done only once and stored in 
the database. The final representation of the document is thus only the indexes 
associated with it, which include the meta-data, the features and the textual 
information associated with the document’s transcript. It is to be noted that in 
contrast with IR, the indexes here are content (textual) as well as cognitively- 
based (FDE features). The database provides the necessary functionality to do 
string and numeral based comparisons. 




Fig. 2. Dialog interface for database: allows user to enter meta-data for a document 



3 Retrieval 

Related to the issue of retrieving documents from a database that is relevant to 
the query supplied is the need for associating a quantitative measure of similarity 
between two samples. For the task of writer identification, the goal is to take the 
document as a query to compare with some or all of the document data in the 
database. The matching between the query document and each document in the 
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database is performed using many attributes of the documents. The matching 
may be done using the micro features, macro features or their combination. 



3.1 Micro-feature Similarity 



To measure the similarity between two characters whose shapes are represented 
using binary vectors we use the Correlation measure [9] . Given 
the number of occurrences of matches with i in the first feature vector and j 
in the second feature vector at the corresponding positions, the dissimilarity D 
between the two feature vectors X and Y is given by the formula: 



D{X,Y) 



1 — S'liS'oo + *S'io>S'oi 

2((5io + ^ii)(5'oi + ^oo)(^ii + 5'oi)(^oo + 



( 1 ) 



The distributions of distances in both the same-writer and different-writer 
categories follow a univariate Gaussian density. During training the parameters 
of these distributions are estimated: the mean ^sw and variance usw for same- 
writer category and the mean hdw and variance for different- writer. They 

define the two densities psw(x) and pow{x)- 

During matching, for each character Ci,i = 1, - ■ ■ , N, where N is the number 
of characters considered, we compute the distance dl between the two samples 
of pair j for that character. We can have j = 1, • • • , Mi possible pairs of samples 
for a given character. For characters Ci,i = 1, - ■ ■ ,N and Mi pairs of samples for 
each character we estimate the log-likelihood ratio: 



LLR{micro) = In ( 

\Ut,jPDw{dl) J 

If LLR{micro) > 0, we have a same- writer decision, if not we have a different- 
writer decision. 

To match documents based on micro-features, it is necessary to recognize 
characters, either manually or automatically, so that matching can be performed 
between the same characters. Gharacter sizes are estimated knowing average 
height of text lines and word information. The estimates are used to filter con- 
nected components, so that those of appropriate size are candidate characters. 
In the case of cursive writing touching characters are separated using a word 
recognizer. The word recognizer takes the word image and its text transcrip- 
tion to segment into characters before sending them to the character recognizer. 
Word transcription is made in one of two ways: user types the content of each 
word during document registration time using an interface or using automatic 
transcript mapping functionality. This allows a pre-typed transcript being auto- 
matically read in and the content of each word matched with the corresponding 
word image automatically. Other geometric information is obtained at the doc- 
ument image processing stage. 
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3.2 Macro-feature Similarity 

The macro features are mapped into a distance vector of differences. The dis- 
tance distributions in both the same-writer and different-writer categories using 
the macro-features follow the same univariate Gaussian density of the form we 
used above for that of micro- features. Similar to the training and testing for 
using micro- features, the distribution coefficients are computed. The verifica- 
tion decision for any pair of documents using macro-features is made also using 
formula (2) above resulting in LLR(macro) where the distance are measured us- 
ing absolute differences for each of the real-valued macro features. The degree of 
match between two documents is measured using the total LLR score as follows: 
LLR{micro) + LLR{macro). 

This approach to similarity measurement is similar to the probabilistic model 
of information retrieval [10]. In this model, the LLR gives a relevance/similarity 
measurement for document retrieval. 

As a document retrieval system, CEDAR-FOX creates a handwritten doc- 
ument database through its rich set of tools. For each handwritten document 
image, the system collects all the related information including the document 
image itself, the features for matching, the region of interest (ROI) that is se- 
lected manually by the user and all the possible meta-data. For collecting the 
information, the system provides a rich set of interactive tools for user to spec- 
ify any local details in the document image. Among the tools, there is an easy 
data entry function with which user can type in a transcript of the document 
to match the image, a easy access tool for fixing automatic segmentation prob- 
lems by merging or splitting word images and a modification tool for fixing any 
character recognition problem. 



3.3 Query Methods 

Based on Meta-data. Meta-data are text data such as identification number, 
writer and other collateral information. In real forensic applications, there are 
often text data related to a handwritten document image. They include the 
time and date the document collected, descriptions about the case, keywords 
for efficient text search and registration number as identification. Other useful 
information can be the possible linkage to any known case, know document and 
the author of the document. The system provides easy data entry tools to be 
able to add to the database tables the meta-data that the user types in. The 
meta-data will be then be stored as a record corresponding to the document. 
Thus a document in terms of a database entity can be considered as a single 
record or a ‘tuple’. 

CEDAR-FOX provides efficient retrieval of such a database. Several query 
modalities are permitted for retrieval. The database functionality has been im- 
plemented using MySQL, the database management system by MySQL AB^^ 
and the interface to the database is through the MySQL libraries. From the 
user point of view, a graphical user interface has been provided which takes as 
input the fields on which the user wants to query the database of documents. 
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This query is mainly based on textual information and meta-data. The user can 
query the database in order to retrieve documents relevant to the meta-data and 
textual information he uses in the query. 



Based on Document Features. Another type of querying is based on the 
features of the documents and this is where identification comes to play. The 
process of identification is nothing but querying in which the query consists of 
a document. Unlike the IR counterpart where the query has to be considered 
as a pseudo-document for which similar relevant documents are retrieved, here 
the query is a real document and the task of this information retrieval system 
is to use the similarity measurements to pull-out documents that it thinks is 
relevant. In effect, when we give a document as a query, we expect the system to 
filter out only those documents that it thinks is written by the same writer who 
wrote the query document. The relevance or similarity of the retrieved document 
is measured using the similarity metrics presented above in equation (3). The 
log likelihood ratio is used to decide similarity, just like the probabilistic model 
of information retrieval does. When the query involves a document rather than 
textual information entered by the user, there are three possible options the 
user can use to define what part of the document he wants to use to query the 
database. 

Document Level: The Entire Document Image Is the Query. When the system 
loads in a document image, it can be directly used as query. For identifying 
the document from the database, the automatically extracted features are used 
for the matching. The query returns a ranked list of documents in the library. 
The scores attached with each document is computed using as much as available 
information for the query document (Fig. 3). 




Fig. 3. Query result showing a ranked list of documents 



Partial Image: A Region of Interest (ROI) of a Document. A document image 
may include many text or graphical objects. User often needs to specify a local 
region of the most interest. Using a system cropping tool, user can easily crop a 
rectangular region and use the ROI as his query. 
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Word Level Image: Any Word in the Document. Any word from the query doc- 
ument can be cropped and compared against other documents in the database. 
The result will be a ranking of he words in other documents that are most similar 
with the query. The original document containing the retrieved words can easily 
be obtained from the database. 

Based on Text Keyword. The user can type in any keywords ranging from 
the words in the documents, case number, person names, time and the pre- 
registered keywords such as brief descriptions of the case. The text identification 
is done by matching between the query text words and the text in the digital 
library. The matching considers the priority of the information represented by 
the words. The distance measure is edit distance based. 

4 Results and Analysis 

4.1 Experimental Framework 

The document image set consists of 3000 samples written by 1000 writers, where 
each writer wrote three samples of a preformatted letter. From each document 
images of numerals and alphabets were extracted and represented as 512-bit 
binary vectors. The distance between two character images is given by a real- 
valued similarity distance between the coprresponding feature vectors. Real- 
valued features (macro) were also extracted for the documents. 

The writer identification performance is evaluated for two scenarios: (i) same- 
content - the documents being compared have the same text content and (ii) 
different-content scenario - the documents being compared have different text 
content. This distinction is important, since in most cases the documents to be 
compared are presumed to have different content. 



Same Content. In this scenario, the 3000 document set is divided into a train- 
ing set of 2000 documents and a test set of 1000 documents. Therefore, each 
writer has two documents in the train set and one document in the test set. 



Different Content. For each of the 3000 documents, an imaginary center line 
is computed and used to split the document into an upper and lower half. The 
features described before are extracted from both halves. Two similar sets of 
2000 and 1000 document images are built randomly selecting half-images for 
each writer and assigning them to the sets. In the comparison process we ensure 
that only different halves are being compared. Some of the old macro features 
(entropy, threshold, number of black pixels and average height) could not be 
computed for this scenario and were simply not taken into consideration in the 
final combination mix. 

For both scenarios we have used a weighted version of the k-nn classifier for 
identification. 
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4.2 Results 

The document management system CEDAR-FOX on which these experiments 
are conducted is able to perform document retrieval at the document image, 
document word image or text keyword level. Document image retrieval, for which 
the experimental results presented here have been obtained, provides the user 
with an automatic and efficient way to identify a questioned document from a 
large set of known documents in the database. Each document image in the test 
set is compared with every document image in the train set. The comparison is 
done in the feature space, using some or all of the previously described features. 

We have estimated both the individual and accumulated writer identification 
performance: (i) individual performance of each feature, (ii) accumulated perfor- 
mance of previously proposed features, newly proposed ones and the character- 
level features (extracted from all 62 characters for the same content scenario and 
from the common characters only in the different content scenario). We previ- 
ously reported results for 12 macro features in [7] and shown the performance 
to be 99% for 2 writers and about 59% for 900 writers. Along with 10 character 
features the identification performance was shown to be 87%. 

The individual and accumulated writer identification performance results ob- 
tained using the newly proposed features are presented in Table 1. As expected, 
the same-content performance is much better than the different-content perfor- 
mance. We note that to obtain these values we identify the writer of the top 
retrieved document image as our result. If we retrieve the top 5 or top 10 doc- 
ument images (out of the set of 2000) or we combine these features with others 
the performance becomes significantly better (almost 70%). 



Table 1. Individual and accumulated writer identification performance of proposed 
features for same and different content 



Features 


Identihcation 


Same Content 


Different Content 


Lexeme 


16.36 


3.98 


Character 


9.01 


7.05 


Word Legibility 


1.33 


0.20 


Inter-Character Distance 


0.72 


0.41 


Char Relative Height 


3.48 


0.41 


Char Relative Slant 


2.56 


1.02 


Proposed Features 


35.62 


11.34 



Fig. 5 presents the retrieval results for accumulated features in the same- 
content and different-content scenarios. The first plot diplays the precision-recall 
curves obtained for 50 features (old-|-new -|- numerals -I- some lower-case char- 
acters) when varying the number of top images retrieved from 1 to 20. For the 
second plot we maintain fixed the number of top retrieved images (top 10) and 
vary the number of features used (1-50). As we can see, the writer identification 
performance in the same-content case goes about 90% when the top result is re- 
turned. These results clearly depict the usefulness of our system in querying the 
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Recall 



Fig. 4. Precision-recall curves obtained for same-content /different-content scenarios for 
(a) fixen number of features (50) and variable number of top choices considered (1-50) 
and (b) variable number of features (1-50) and fixed number of top choices considered 
(15) 

large document collection to return one or more documents that match closely 
with the given query document. 

We can also observe that: (i) the performance of the proposed features is 
highly dependent on the type of comparison scenario (same-content vs. different- 
content); (ii) while individually some features perform much better than others, 
each one brings its own contribution to a superior accumulated performance. 

5 Conclusion 

We have described the information retrieval aspects of a document analysis and 
management system for handwritten documents. Writer characteristics as well 
as document content and meta data can be used for retrieval. The performance 
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of the system using a large set of document and character-level features has been 
presented. 
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Abstract. In a country like India, a single text line of most of the official docu- 
ments contains two different script words. Under two-language formula, the In- 
dian documents are written in English and the state official language. For Opti- 
cal Character Recognition (OCR) of such a document page, it is necessary to 
separate different script words before feeding them to the OCRs of individual 
scripts. In this paper a robust technique is proposed to extract word-wise script 
identification from Indian doublet form documents. Here, at first, the document 
is segmented into lines and then the lines are segmented into words. Using dif- 
ferent topological and structural features (like number of loops, headline fea- 
ture, water reservoir concept based features, profile features, etc.) individual 
script words are identified from the documents. The proposed scheme is tested 
on 24210 words of different doublets and we received more than 97% accuracy, 
on average. 

Keywords: Script Identification, Indian script, Bangla script, Malayalam script, 
Gujarati script, Devnagari script, Telugu script. Multi-script OCR. 



1 Introduction 

In India there are 19 languages and 12 scripts are used for these languages. In a coun- 
try like India, a single text line of most of the official documents contains two differ- 
ent script words. Under two-language formula, the Indian documents are written in 
English and the state official language. For Optical Character Recognition (OCR) of 
such a document page, it is necessary to separate different script words before feed- 
ing them to the OCRs of individual scripts [2]. In this paper a robust technique is 
proposed to extract word-wise script identification from Indian doublet form docu- 
ments. Here, we consider five major Indian doublet documents. These doublets are 
{Devnagari, English}, {Bangla, English), {Malayalam, English), {Telugu, English), 
and {Gujarati, English). 

Among the pieces of earlier work of different script line separation. Spitz [7] de- 
scribed a method for classification of individual text from a document containing 
English (Roman) and Japanese text. Later, Spitz {8] developed a method to separate 
Han based or Latin based script separation. He used optical density distribution of 
characters and frequently occurring word shape characteristics for the purpose. Using 
cluster based templates, an automatic script identification technique has been de- 
scribed by Hochberg et al. {4). Wood et al. {10] described an approach using filtered 
pixel projection profiles for script separation. Ding et al. {3] proposed a method for 
separating two classes of scripts : European (comprising Roman and Cyrillic scripts) 
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and Oriental (comprising Chinese, Japanese and Korean scripts). Recently, using 
fractal-based texture features. Tan [9] described an automatic method for identifica- 
tion of Chinese, English, Greek, Russian, Malayalam and Persian text. Previously, 
we have developed an automatic scheme for the text line identification from docu- 
ment containing Indian script [5]. Script characteristics, shape based features and 
statistical features are used for the purpose. All the above pieces of work deal with 
line-wise script identification. 

In this paper a robust technique is proposed for word-wise script identification 
from Indian documents. In the proposed method, at first, the document is segmented 
into lines and then the lines are segmented into words. Using different topological and 
structural features (e.g. number of loops, headline feature, profile features, water res- 
ervoir concept based features like reservoir height and width, water flow direction, 
reservoir position, etc.) individual script words are identified from the documents. 

The organization of the paper is as follows. In Section 2 properties of five major 
Indian scripts are discussed. We did not describe the properties of English since they 
are well known. Pre-processing like text digitization, noise cleaning, line and word 
segmentation etc. are described in Section 3. Different features used in the identifica- 
tion scheme are discussed in Section 4. Text word separation scheme is described in 
Section 5. Finally, experimental results and discussions are provided in Section 6. 

2 Properties of Five Different Indian Scripts 

Devnagari is the most popular script in India. It has 12 vowels and 33 consonants. 
They are called basic characters. Vowels can be written as independent letters, or by 
using a variety of diacritical marks which are written above, below, before or after the 
consonant they belong to. When vowels are written in this way they are known as 
modifiers. Sometimes two or more consonants can combine and take new shapes. 
These new shape clusters are known as compound characters [1]. These type of basic 
characters, compound characters and modifiers are present not only in Devnagari but 
in other four scripts, except English. Bangla script has many similarities with Devna- 
gari script. Modern Bangla script has 10 vowels and 32 consonant. In both Bangla and 
Devnagari alphabet it is noted that many characters have a horizontal line at the upper 
part, which is known as Shirorekha or Headline [1]. No English character has such 
characteristic and so it can be taken as a distinguishable feature to extract English 
from both Bangla and Devnagari. Malayalam script has 13 vowels and 37 consonants. 
Most of the characters in this particular script have a convex curve type shape at their 
left or right end or both. During the 2“^^ half of the 20* century a new written Telugu 
script evolved based on modern spoken language. Telugu script has 14 vowels and 34 
consonants. Most of the characters in Telugu script has a ‘ ’ like part in their upper 

region which is a distinct characteristic of this script. The Gujarati script was adopted 
from the Devnagari script. That’s why this script has much similarity with Devnagari 
except that characters in Gujarati script has no Shirorekha type feature. Gujarati script 
has 11 vowels and 32 consonants. All the basic characters in these five scripts are 
shown in Fig. 1 . A text word in these six scripts can be partitioned into three zones. 
The upper zone denotes the portion above the headline, the middle zone covers the 
portion of basic (and compound) characters below headline and the lower zone is the 
portion where some of the modifiers can reside. The imaginary line separating middle 
and lower zone is called base line. 
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Fig. 1. Basic characters of five Indian scripts. 



3 Preprocessing 

The images are digitized by a HP scanner at 300 DPI. The digitized images are in 
gray tone and we have used a histogram based thresholding approach to convert them 
into two-tone images (0 and 1). The digitized image may contain spurious noise pix- 
els and irregularities on the boundary of the characters, leading to undesired effects on 
the system. Median filtering technique is applied here to reduce the noise. The lines of 
a text block are segmented by finding the valleys of the horizontal projection profile 
computed by row-wise sum of pixel values. The position where profile height is the 
least denotes one boundary line [1]. A text line can be found between two consecutive 
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boundary lines. Finding the valley of the vertical projection profile, a text line is seg- 
mented into words. If the width of a valley is greater than 2*W, we assume that the 
valley is the separator of two words. The computation of W is done in the following 
way. We compute horizontally the mode of the white runs occur between two con- 
secutive black runs of a line. This mode is W and it generally represents the distance 
between two characters in a word. From the experiment we noticed that distance be- 
tween two words is greater or equal to twice of the distance between two characters in 
a word. Hence, the threshold value for word segmentation is assumed as 2*W. 



4 Features Used for Script Identification 

The features are chosen with the following considerations (a) Presence in characters 
of some scripts and absence in characters of at least one script (b) Robustness, accu- 
racy and simplicity of detection (c) Speed of computation and (d) Independence of 
fonts, size and style of the text. Some of the principal features used in the scheme are: 

Shirorekha feature: If we take the longest horizontal run of black pixels on the rows 
of a text line then the length of such run in Bangla and Devnagari scripts will be much 
higher than that of English script. This is because characters in a word are connected 
by head-line in Bangla as well as in Devnagari script. For illustration, see Fig. 2. Here 
row-wise maximum run is shown in the right part of the words. This run information 
has been used to separate Bangla and Devnagari scripts from English. But for other 
scripts this feature cannot work to separate English from those scripts. We say head- 
line feature exists in a word if one of the following two conditions satisfy in the word 
(a) if the length of the longest run is greater than 70% of the width of a word (h) if the 
length of the longest run is greater than 2 times of the height of middle zone. 

Distribution of vertical stroke feature: One of the most distinctive and inherent 
characteristics of most of the English characters is the existence of vertical line-like 
stroke at the leftmost part or both left & right of the English character. This stroke in a 
character can he computed hy measuring the vertical run of black pixel at the leftmost 
part of the character. The character is said to have vertical stroke if the length of a 
black pixels run is at least 70% of the character height. If in a particular word 40% of 
the characters satisfy this feature then the word is treated as English. If the text is 
written in italics style this vertical stroke detection method may not work. We use a 
profile based method for italics vertical line detection. We compute left/right profile 
of a character and we observe the behaviors of the profile in the region between the 
mean-line and base line. If all left/right profiles of this region have unique behavior 
(either all increasing or all decreasing mode) we decide that a vertical line exists in 
that character as shown in Fig. 3. Here all the left profiles in the character are in de- 
creasing mode from top to bottom. Hence, we assume that a italics vertical line at the 
leftmost part exists in the character. 

Water reservoir principle based feature: The water reservoir principle is as fol- 
lows. If water is poured from one side of a component, the cavity regions of the com- 
ponent where water will be stored are considered as reservoirs [6]. By top (bottom) 
reservoirs we mean the reservoirs obtained when water is poured from top (bottom) of 
the component. (A bottom reservoir of a component is visualized as top reservoir 
when water will be poured from top after rotating the component by 180°). Similarly, 
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Fig. 2. Row-wise longest horizontal run is 
shown in five different script words. 



Fig. 3. Vertical line detection approach for 
italics character. 



if water is poured from left (right) side of the component, the cavity regions of the 
component where water will be stored are considered as left (right) reservoirs. All 
these reservoirs are shown in Fig. 4 for English character ‘X’. 

This water reservoir feature can be used as an useful tool to distinguish different 
script words. To identify English script word from other script words we use vertical 
stroke feature in most of the cases. Because this feature exists in most of the English 
characters. But there are some English words where many characters with vertical line 
may not occur. To identify such English words from other script words we use water 
reservoir concept based features. In English it is found that characters without vertical 
line have left/right or both left and right reservoirs. Now if a character has both left 
and right reservoir then the reservoir with the longest height is taken. If the total sum 
of this reservoir height and the stroke width of the character exceeds 70% of the char- 
acter width then that character is marked. If for a particular word, 40% of the charac- 
ters are marked then the word is identified as English. For illustration see Fig. 5. In the 
word shown in Fig. 5 has seven characters. Out of these seven characters four charac- 
ters satisfy this reservoir property. Hence this word is treated as English. This feature 
is used to extract English from doublets like {Bangla, English}, {Devnagari, Eng- 
lish}, {Malayalam, English}, {Gujarati, English}, {Telugu, English}. 
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Fig. 4. Water flow level of reservoir is shown by Fig. 5. Reservoir height and stroke width 
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The water reservoir feature is also used to identify Malayalam words from {Mala- 
yalam, English} doublet. To identify Malayalam words the water flow direction of 
reservoir is used. There are characters in Malayalam, which have top reservoirs where 
water flows from ‘Right to Left’ as shown in Eig. 6. Such characters occur frequently 
in Malayalam. But this kind of feature is absent in English characters. So if within a 
word at least one character has this feature then the word containing the character is 






Word-Wise Script Identification from Indian Documents 315 



identified as Malayalatn. The character shown in Fig. 6 is ‘Pa’ in Malayalatn alpha- 
bet. 

There are Malayalatn characters which have top reservoirs whose height is almost 
that of the height of the character as well as there is a loop situated at the left portion 
of the reservoir. Example of a character, which has such property, is shown in Fig. 7. 
This is ‘La’ in Malayalam alphabet. The character with such property is totally absent 
in English. So if in a particular word at least one such character exists then that word 
is treated as Malayalam. These two reservoir based features are used to extract Mala- 
yalam words from English words. 




Fig. 6. Malayalam character contains top Fig. 7. Malayalam character contains loop 
reservoir with water flow direction from within the left part of the top reservoir. 

‘Right to Left’. 



The reservoir based feature is also used to extract Gujarati from {Gujarati, Eng- 
lish} doublet. There are Gujarati words with a single character as shown in Fig. 8. This 
character is ‘chha’ in Gujarati alphabet. This type of character has a top reservoir 
whose height is greater than 80% of the character length. As this type of single char- 
acter word is totally absent in English so the existence of such character identifies the 
Gujarati word from English. 

The water reservoir concept based feature is also used to extract some Telugu 
words from English words. If water reservoir is considered from left side of a charac- 
ter then in Telugu script several characters can be found with left reservoir, whereas 
in English only five characters (a, s, x, y, z) have left reservoirs. Among them ‘s’, ‘x’ 
and ‘z’ have both left and right reservoirs. So if left reservoir is found for a particular 
character in a word then that character is also tested for right reservoir. If the charac- 
ter has no right reservoir then that character is marked. In a particular word if the 
existence of such characters is 20% or more then the word is treated as Telugu. 

Shift Feature Below Shirorekha 

There are some Bangla/Devnagari words which may not satisfy the Shirorekha feature 
although some characters in the word may have head lines. For them, at first, each 
character of the word is segmented and leftmost black pixel of the head-line of each 
segmented character is noted. Let this pixel is Xj Next head-line position of each seg- 
mented character is deleted, and after deletion of head-line, the first black pixel from 
top is noted. Let this pixel is x^ If the distance between Xj and x^ is greater than or 
equal to 50% of the width of the character then that character is noted. If the number 
of such characters in a word is 20% or more then the word is treated as Bangla in case 
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of {Bangla, English} doublet and Devnagari in case of {Devnagari, English} doublet 
because this type of feature is almost absent in English. Eor illustration see Eig.9. 
which is a Bangla word. Though this word have Shirorekha on some character hut as 
a whole it doesn’t satisfy the Shirorekha feature but satisfy Shift feature below Shi- 
rorekha and treated as Bangla. 




Fig. 8. Gujarati character with single Top 
Reservoir. 



Cliai'dcler Scpaialiun Point 
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Fig. 9. Shift feature below Shirorekha Shown in 
a Bangla word. 



Left and right profile: Suppose each character is located within a rectangular bound- 
ary, a frame. The horizontal or vertical distances from any one side of the frame to the 
character edge are a group of parallel lines which we call the profile (see Fig. 10). If 
we compute left and right profile of the characters in a word, we can notice some 
distinct difference between Malayalam and English scripts. For example, in both the 
left and right profiles of Malayalam script, most of the characters have one transition 
point because of their convex shape and this feature is rare in English. By transition 
we mean change of the profiles from increasing mode to decreasing mode or vice- 
versa. We use left and right profile feature for the identification of Malayalam script. 
Left and right profile of a Malayalam character is shown in Fig. 10. 

There are many Telugu characters which have right profile with a single transition 
point, from decreasing to increasing. But in case of English this type of behavior is 
almost absent. So this profile based feature is used in {Telugu, English} doublet to 
extract Telugu characters from English. In a particular word if at least 20% of the 
characters satisfy this feature then the word is identified as Telugu. 

Deviation feature: A character is first tested for right vertical line. If it has a right 
vertical line then the topmost and bottommost row value of that vertical line is noted. 
Then from the left bottommost part of the vertical line, anticlockwise rotation is per- 
formed up to 30% of the character length. During this traversal the leftmost column 
value is calculated for each row and the distance between two leftmost columns of 
two consecutive rows are noted. If one such difference is found which is greater than 
1 .5 times the stroke width among these differences then the row value corresponding 
to that difference is stored. If this row value is greater than 30% of the component 
length and the difference between this row value and the bottommost row value is 
greater than 1.5 times the stroke width then this character is marked provided that 
character has only one vertical line and that is a right vertical line. The definition of 
vertical line is already discussed earlier. If total number of such characters is greater 
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than 20% of the total characters in the word then the word is treated as Gujarati. This 
particular feature is almost absent in English and so this feature is used to extract 
English from Gujarati words. Fig. 11 shows this particular feature. 
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Fig. 10. Left and Right profile of a character is Fig. 11. Deviation feature of a character is 
shown. shown. 



Loop feature: Loop is defined as the hollow region enclosed by black pixels in a 
particular character. The size of this hollow region may vary from a single white pixel 
up to an area compatible with the area enclosed by the character itself. There are 
characters in some Indian scripts like Malayalam which have more than two loops of 
significant size. So, this feature can be used to distinguish English characters from 
Malayalam characters. Although, there are many characters in Malayalam with more 
than two loops, there exists no English character which has more than two loops. But 
if two consecutive characters touch then sometimes in English a components may 
have two loops. To overcome the connectivity problem it is first checked if the width 
of the character concerned is smaller than twice the length of the character height. If it 
is so and the character has more then two loops, then the word containing that charac- 
ter is considered as Malayalam in case of {Malayalam, English} doublet. One such 
Malayalam character is shown in Fig. 12. 

Tick feature: In most of the characters in Telugu script there is a distinct “tick” like 
feature called Telakattu at the top of a character. This feature is very helpful to iden- 
tify Telugu script from English script because no such “tick” appears in the English 
characters. A Telugu character having a “tick” feature is shown in Fig. 13. The posi- 
tion of “tick” feature in Telugu script is top of a character. To identify this “tick” 
feature in a character, at first, it is checked whether there is a top reservoir with left 
flow level at the upper part of the character or not. If two or more such reservoirs 
exist then the smallest and the uppermost one is to be considered. The water flow 
level of the reservoir is noted and let the water flow point be (X,Y). Starting from 
base point (x, y) of the reservoir the boundary up to the topmost point of the character 
is traced in a clockwise manner. Let the number of these traced points be N and these 
points be (Xj , yj ), i = 1 . . . .N. (By base point we mean the lowermost point of the reser- 
voir). Similarly, from the base point of the reservoir anti-clockwise movement is 
performed along the boundary of the character up to the water flow point (X, Y). Let 
the number of these traced points be M and let these points be (x/, y/), i=l . . ..M. For a 
character we say a “tick” feature exists if the following three conditions are satisfied, 
(a) N is greater than M, (b) (y^ - y) = (y, - y) = = (y^- y) and (c) (y - y'^) = (y - 

y'^) = = (y-/M)- 
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Fig. 12. Malayalam character containing more 
than two loops. 




Fig. 13. Tick feature detection approach In a 
Telugu character. 



Left inclination feature: There are some words with two or three characters in 
Bangla which do not satisfy the head-line feature or shift feature below Shirorekha 
but occur frequently in various sentences. For these words a new technique is pro- 
posed. It is found that most of these words contain some special type of characters 
which have a tendency of right to left inclination at the lower part of the character, as 
shown in Fig. 14. This Bangla character (named ‘e’ ) has this type of characteristic as 
shown in the diagram. This type of characteristic can be obtained as follows: 

The lowermost part of each character is scanned left to right from bottom. The first 
pixel achieved in this way is stored. Starting from this pixel a clockwise rotation is 
performed until half of the character length is achieved. During rotation all the visited 
pixels are stored. If these pixels are all in decreasing manner and the length of this set 
of pixels i.e. the height of this inclined portion is greater than 60% of the half of the 
component length then it is said that a left inclination exists in the bottom portion of 
the character. If in a particular word one or more such characters satisfy this feature 
then the word is treated as Bangla. This feature is shown in Bangla character ‘e’ in 
Fig.l4. 

In Fig. 14 a Bangla word is shown which has only two characters. This particular 
word does not satisfy the Head-line or Shirorekha feature as well as Shift feature 
below Shirorekha. For these type of words left inclination feature is used. Here the 
Bangla character ‘e’ satisfies left Inclination feature as shown in Fig. 14 and hence the 
word is treated as Bangla. 



5 Script Identification Techniques 

To extract English words from different Indian scripts a tree classifier is designed. In 
this classifier user first gives a choice (input) for a particular doublet, manually. Based 
on this input the control moves to the subsequent subtree corresponding to that dou- 
blet for classification. Here the subtree corresponding to each doublet is discussed 
separately, because of page limitation of this workshop, here we briefly present the 
identification scheme. 
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Fig. 14. Left Inclined portion is shown in a Bangla Character. 

{Bangla, English} Doublet: The primary feature used in the subtree for this doublet 
is “Shirorekha” feature, because it is noted that the probability of occurrence of Head- 
line or Shirorekha in a Bangla word is 0.997%. So it is justified to use “Shirorekha” at 
the top of the tree classifier. If this feature satisfies the word is identified as Bangla 
word. Otherwise the word is checked through shift feature. If shift feature satisfies the 
word is treated as Bangla, if not the word is tested by left inclination feature. If satis- 
fies it’s marked as Bangla otherwise tests for English words begin upon the word. 
First test is based on vertical stroke feature and the second one is based on water res- 
ervoir concept based feature. If the word satisfies any one of these two tests it is iden- 
tified as English otherwise it is left as confused. 

{Devnagari, English} Doublet: Similar subtree, which is used for {Bangla, English} 
doublet, is also used for {Devnagari, English} doublet due to the similarity in script 
characteristics between Bangla and Devnagari scripts. 

(Malayalam, English} Doublet: In the subtree corresponding to this doublet convex- 
ity or profile based feature is taken as the primary feature because it is observed that 
most of the Malayalam characters have either left or right or both left and right profile 
as discussed earlier. So it is justified to take this feature as an initial feature. If the 
word satisfies the profile based feature then it is identified as Malayalam word, oth- 
erwise it is checked for reservoir based features as discussed earlier. If reservoir fea- 
ture satisfies it is treated as Malayalam, otherwise it is checked whether any character 
of it has more than two loops or not. If such feature exists then it is Malayalam. Oth- 
erwise, vertical line feature is tested on the word. If vertical line like feature exists 
then the word is English. Else, water reservoir based feature is tested on it and if this 
feature exists the word is English. Otherwise, the word is left as confused. 

(Gujarati, English} Doublet: As most of the Gujarati characters satisfy the deviation 
feature so it is taken as the principle feature to distinguish between English and Guja- 
rati words in the subtree. Words satisfy the deviation feature are identified as Gujarati 
words. But the words not satisfying deviation feature are tested for a special feature 
known as upper zone feature. It is observed that most of the Gujarati words have a 
modified character at the upper part of the word. The length of the modified character 
is greater than any of the dots, which occur frequently in the upper part of English 
words. So if a modified character exists in the upper part of a given word whose 
length is greater than half of the component length then that word is treated as Guja- 
rati. The words not satisfying this feature are checked by their size. If it is found that 
the word consists of a single character then water reservoir concept based feature for 
Gujarati character is applied to it. If this feature satisfies the word is treated as Guja- 




320 S . Sinha, U. Pal, and B .B . Chaudhuri 



rati. Otherwise the tests for English characters are applied. Vertical line feature and 
water reservoir concept based features are used for the purpose. If the word satisfies 
any of these features then it is treated as English, otherwise it is treated as confused. 

{Telugu, English} Doublet: As most of the Telugu characters have the ‘tick’ feature 
so it is justified to use this feature as the principle feature in the subtree. If this feature 
is satisfied for a word then it is identified as Telugu, otherwise left reservoir feature is 
used for identification. If left reservoir feature satisfies, it is identified as Telugu. 
Otherwise, right profile based feature is used. If this feature exists then the word is 
identified as Telugu. If not then test for English word begins. If the word satisfies any 
of the two features (one is vertical line based feature and the second one is water res- 
ervoir based feature) then it is treated as English, otherwise it is left as confused. 



6 Result and Discussion 

For experiment two data sets were considered from different documents like question 
papers, bank account opening application form, money order form, computer print- 
outs, translation books, dictionary etc. Some of the data captured from dictionary and 
money order application form were inferior quality. A data set of 7500 data was used 
for training and the other data set of 24210 data was used for testing of the propose 
schemes. We noted that the accuracy rates of script word separation schemes of five 
doublets {Devnagari, English}, {Bangla, English}, {Malayalam, English}, {Gujrathi, 
English} and {Telugu, English} were 97.14%, 98.30%, 98.89%, 97.22 and 98.06%, 
respectively. These statistics are computed from 5196, 5244, 5189, 5073 and 3508 
script words of (Devnagari, English}, (Bangla, English}, (Malayalam, English}, 
{Gujrathi, English} and (Telugu, English} doublet documents. The overall accuracy 
of the proposed system was about 97.92%. 

From the experiment we obtained highest accuracy in (Malayalam, English} dou- 
blet. This is because most of the Malayalam characters have a profile based feature 
and vertical line like feature does not exist in the side of the characters. Also, from 
the experiment we noticed that most of the errors are generated from poor and noisy 
documents which are mostly broken after digitization. 

The confusion rates of the five doublets scripts are 1.92%, 3.69%, 1.25%, 5.03%, 
4.36%, respectively. Also, we observed that most of the error comes from small 
words with number of characters three or less. 

This scheme does not depend on the size of characters in the text line. Also we no- 
ticed that this approach is font and case insensitive. The use of simple features, which 
are easy to compute, make our system fast. Average execution time for an A4 size 
document on a P-IV machine is about 1 1 seconds. 

The work presented here is a step towards building a multi-lingual OCR system 
that can work for all major Indian scripts. The next step is to build OCR modules for 
each individual scripts. To this end, successful modules for Bangla and Devnagari has 
already been demonstrated [2]. We encourage researchers to develop OCR technolo- 
gies for other Indian scripts so that computer-based Indian document technology 
achieves its adulthood in near future. 
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Abstract. Annotations on digital documents have clear advantages over annota- 
tions on paper. They can he archived, shared, searched, and easily manipulated. 
Ereeform digital ink annotations add the flexibility and natural expressiveness 
of pen and paper, but sacrifice some of the structure inherent to annotations cre- 
ated with mouse and keyboard. Eor instance, current ink annotation systems do 
not anchor the ink so that it can be logically reflowed as the document is resized 
or edited. If digital ink annotations do not reflow to keep up with the portions of 
the document they are annotating, the ink can become meaningless or even mis- 
leading. In this paper, we describe an approach to recognizing digital ink anno- 
tations to infer this structure, restoring the strengths of more structured digital 
annotations to a preferable freeform medium. Our solution is easily extensible 
to support new annotation types and allows us to efficiently resolve ambiguities 
between different annotation elements in real-time. 



Introduction 

While the vision of the paperless office remains a distant hope, many technologies 
including high-resolution displays, advances in digital typography, and the rapid pro- 
liferation of networked information systems are contributing to a better electronic 
reading experience for users. One important area of enabling research is digital docu- 
ment annotation. Digital annotations persist across document versions and can be 
easily searched, shared, and analyzed in ways that paper annotations cannot. 

Many digital annotation systems employ a user interface in which the user selects a 
portion of the document and a post-it-like annotation object is anchored at that point, 
as shown in Fig 1(a). The user enters text into the post-it by typing on the keyboard. 
Later, as the document is edited, the post-it reflows with the anchor. While this 
method is widely used among commercial applications, it is a cumbersome user inter- 
face. Consequently, many users choose to print out their documents and mark them up 
with a pen on paper, losing the benefits of digital annotations in the process. 

A user interface in which users sketch their annotations in freeform digital ink 
(Fig 1(b)) on a tablet-like reading appliance (Fig 1(c)) overcomes some of these limi- 
tations. By mimicking the form and feel of paper on a computer, this method stream- 
lines the user interface and allows the user to focus on the reading task. For instance, 
in describing their xLibris system, Schilit et al introduce the term active reading, a 
form of reading in which critical thinking, learning, and synthesis of the material 
results in document annotation and note-taking. By allowing users to mark directly on 
the page they add “convenience, immersion in the document context, and visual 
search.” [1] 
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Fig. 1. (a) Digital text annotated and edited with "formal" annotations, and (b) equivalently 
with informal, freeform annotations, (c) A tablet-like computer with pen to annotate documents 
with digital ink. 



Ink Annotation Recognition 

In this paper, we describe a technique for recognizing freeform digital ink annotations 
created using a paper-like annotation interface on a Tablet PC. Annotation recognition 
includes grouping digital ink strokes into annotations, classifying annotations into one 
of a number of types, and anchoring those annotations to an appropriate portion of the 
underlying document. For example, a line drawn under several words of text might be 
classified as an underline and anchored to the words it is underlining. We describe the 
full set of annotation types and anchoring relationships we wish to support in the 
System Design section. 

There are several reasons why we wish to recognize digital ink annotations, includ- 
ing annotation reflow, automatic beautification, and attributing the ink with actionable 
editing behaviors. 

Our primary goal is to reflow digital ink, as shown in Fig 2(a) and (b). Unlike their 
physical counterparts, digital documents are editable and viewable on different de- 
vices. Consequently the document layout may change. If digital ink annotations do 
not reflow to keep up with the portions of the document they are annotating, the ink 
can become meaningless or even misleading. Recognizing, anchoring, and reflowing 
digital ink annotations can avoid this disastrous outcome. Golovchinsky and Denoue 
first observed this problem in [2], but we have observed that the simple heuristics they 
report are not robust to a large number of real-world annotations and they do not pro- 
pose a framework in which to incorporate new types of annotations. 

A second goal of recognition is to automatically beautify the annotations, as shown 
in Fig 2(c). While freeform inking is a convenient input medium, Bargeron reports 
that document authors prefer a stylized annotation when reading through comments 
made by others [3]. 

A third goal for recognizing digital ink annotations is to make the annotations ac- 
tionable. Many annotations convey desired changes to the document, such as “delete 
these words” or “insert this text here.” The Chicago Manual of Style [4] defines a 
standard set of editing symbols. By automatically recognizing annotations, we can 
add these behaviors to the ink to further streamline the editing process. 

Fulfilling these goals in a system is a broad task that incorporates many facets 
other than recognition. There are user interface issues such as when and how to show 
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Fig. 2. Annotations reflow and cleaning, (a) Original user annotations (b) are properly reflowed 
as the document is edited and then (c) cleaned by the system based on its automatic interpreta- 
tion. 

the recognition results and how to correct those results. There are software architec- 
ture issues such as how to properly integrate such functionality into a real text editor. 
There are other algorithmic issues such as how to reflow the ink strokes. However, 
after building a full working system we have found it useful to separate the annotation 
recognition as a well-encapsulated software component. In this paper we describe that 
component in detail, including its architecture, algorithm, and implementation. 



Paper Organization 

The paper is organized as follows. We begin with a problem statement: the realm of 
possible annotation types and document contexts is extremely large, and in the Sys- 
tem Design section we first describe and justify the subset we have chosen to recog- 
nize. We then give an overview of the system architecture to introduce our approach. 
We have chosen a recognition approach in which multiple detectors offer competing 
hypotheses, and we resolve those hypotheses efficiently via a dynamic programming 
optimization. Next, we evaluate the approach both quantitatively and qualitatively on 
a set of files. Finally we conclude and outline future work. 



System Design 

In order to support the application features described in the previous section, includ- 
ing reflow, beautification, and actioning, we have designed a software component to 
segment, classify, and anchor annotations within a document context. In this section 
we describe the design of that component. We have scaled back our problem to han- 
dle a fixed vocabulary of annotation types: horizontal range, vertical range, container, 
connector, symbol, writing, and drawing. In this section, we define each of these an- 
notation types, define the document context that is required to perform recognition, 
and justify this restricted approach. 
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Annotation Types 

While the set of all possible annotations is no doubt unbounded, certain common 
annotations such as underlines and highlights immediately come to mind. To define a 
basic set of annotations, we refer to the work of Brush and Marshall [5], which indi- 
cates that in addition to margin notes, a small set of annotations (underline / highlight 
/ container) are predominantly used in practice. We have found that it is useful to 
further divide the category of margin notes into writing and drawings for the purposes 
of text search and reflow behavior. Thus we pose the problem of annotation recogni- 
tion as the classification and anchoring of horizontal range, vertical range, container, 
callout connector, symbol, writing, and drawing annotations. We illustrate examples 
of these types with a sample annotated document shown in Fig 3. 




Fig. 3. Common annotation types. Horizontal range (red), vertical range (green), container 
(orange), callout connector (blue), symbol (magenta), and writing (purple) and drawing (cyan) 
marginalia. 



Document Context 

Annotation is a common activity across a wide variety of documents including text 
documents, presentation slides, spreadsheets, maps, floor plans, and even video (e.g., 
weathermen and sports commentators). While it is impossible to build an annotation 
recognizer that spans every possible document, it is desirable to abstract away the 
problem so that its solution can be applied to a number of common document types. 

Defining this appropriate abstraction for document context is difficult: it is unlikely 
that any simple definition will satisfy all application needs. Nevertheless, we define a 
document context as a tree structure that starts at the page. The page contains zero or 
more text blocks and zero or more graphics objects. Text blocks contain one or more 
paragraphs, which contain one or more lines, which contain one or more words. Each 
of these regions is abstracted by its bounding box (Fig 4). At this point we do not 
analyze the underlying text of the document: this has not been necessary and makes 
our solution language-independent. 
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This definition of context is rich enough to support a wide variety of documents, 
including hut not limited to, word processing documents, slide presentations, spread- 
sheets, and web pages. 




Fig. 4. Simple document context. A basic document context contains words (light blue fill), 
lines of text (solid black border), paragraphs (dashed blue border), blocks (dashed green bor- 
der), and images/pictures/charts (solid red border). 



Recognition Architecture 



Given this design, we have implemented an encapsulated software component for 
annotation recognition. The component receives strokes and document context as its 
input and produces a parse tree with anchors into the document context as its output, 
as shown in Fig 5. This abstraction it is easy to incorporate the recognition component 
into different applications. So far, the annotation recognizer has been deployed in the 
Callisto plug-in to the web browser Microsoft Internet Explorer [3]. 



I ink 



Writing Layout Anaiysis 
and Ciassification 



ink grouping 



iayout 




document 

context 



Fig. 5. High-level annotations recognition architecture. A first step separates writing and draw- 
ing strokes and groups writing into words, lines, and paragraphs. A second step analyzes ink 
relative to a document context, classifies markup elements, and anchors the annotations to the 
document context. 



The recognition component itself consists of several stages, as shown in Fig 5. Ini- 
tially, strokes are run through a component for handwriting layout analysis and classi- 
fication that groups separates writing strokes and drawing strokes and groups writing 
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strokes into words, lines, and paragraphs, as described in [6]. This stage produces an 
initial structural interpretation of the ink without considering the underlying document 
context. 

Once the strokes have been divided into writing and drawing, a markup detection 
stage looks for common annotation markup (horizontal range, vertical range, con- 
tainer, connector, and symbol) relative to the abstraction of the document context, it 
produces a revised structural interpretation of the ink, and links the structures to ele- 
ments in the document context abstraction. We describe the markup detection in the 
following section. 



Annotation Recognition 

Markup detection segments and classifies ink into a set of annotation types including 
horizontal range, vertical range, container, and connector. 

One possible approach to markup detection would be to generate all possible com- 
binations of strokes and classify each with respect to the different classes, maximizing 
some utility or likelihood over all hypotheses. This approach suffers from several 
practical problems. First, it is combinatorial-even generic spatial pruning heuristics 
may not be enough to make the system run in real-time. Second, it relies on enough 
data to train a reasonable classifier and garbage model. 

Since we wanted to generate an efficient system that could keep up with user anno- 
tation in real-time and did not have large quantities of training data available, we 
opted for a more flexible solution. Our markup detection is implemented as a set of 
detectors. Each detector is responsible for identifying and anchoring a particular an- 
notation type among the ink strokes on the page, and uses a technique specific to its 
annotation type in order to prune the search space over possible groups. When a de- 
tector identifies a candidate for a particular annotation type, it adds the resulting hy- 
potheses with an associated confidence to a hypothesis map, as shown in Fig 7. For 
example, in Fig 7(c), a connector detector hypothesizes that strokes could be connec- 
tors on their own (both are relatively straight and have plausible anchors at each of 
their endpoints, or that they could together form a single connector. We say that a pair 
of hypotheses conflict if they share any of the same strokes. 

Detectors 

Each annotation type has a set of characteristic features that allow it to be distin- 
guished from other annotations and from random strokes on the page. These features 
can be divided into two categories: stroke features and context features. 

Stroke features capture the similarity between a set of ink strokes and an idealized 
version of an annotation. Eor example, the idealized version of an underline is a 
straight line, so the stroke features measure the distance between a set of strokes that 
might be an underline and the best straight line that approximates those strokes, i.e. 
the total regression error on the points in those strokes. 

Context features capture the similarity of the best idealized version of a set of 
strokes and a true annotation on the document context. For example, a stroke might be 
a perfect straight line, but it is not an underline unless that line falls beneath a set of 
words in the document. 
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Thus the procedure for each detector is to ascertain a best idealized version of the 
strokes according to its type using stroke features, and then see how well that ideal- 
ized version fits with the document context using context features. The stroke and 
context features for each of the annotation types we support are described in Fig 6. 
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Fig. 6. Detector features, (a) The original ink annotations on the document, (b) The idealized 
annotations overlayed on the ink annotations, and the document context bounding boxes, (c) 
Vertical range context features include 0, the angle between the ideal and the lines of text, g, 
the gap between the ideal and the lines, as well as the sum of the lengths of the overlapping 
portions of the ideal (in red) and sum of the lengths of the non-overlapping regions (in blue), 
(d) Horizontal range context features include 0, the angle between the ideal and the lines of 
text, g, the gap between the ideal and the lines, as well as the sum of the lengths of the overlap- 
ping portions of the ideal (in red) and sum of the lengths of the non-overlapping regions (in 
blue), (e) Callout context features include g, the distance of the arrowhead to a context word 
along the tangent of the tip of the arrow, (f) Container context features include the area over- 
lapping with the context words (light blue) and the non-overlapping area with the context words 
(pink). 



Resolution 

Once all of the detectors have executed, the most likely annotations are extracted 
from the map through a resolution process and the result is committed to the output, 
as shown in Fig 7(d). The resolution is designed to pick the best candidates when 
there are conflicting hypotheses. It is a unifying framework by which detectors can be 
added modularly to support new annotation types. 

Resolution is designed to maximize number of explained strokes, maximize the 
overall confidence, and minimize the number of hypotheses. This can be expressed as 
the maximization of an energy function: 

E = ^ confidence; -I- «| explained strokes] -j0| hypotheses] (i) 
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In Equation 1, a and p are empirically-determined weights. We maximize this 
function exactly using dynamic programming. Since there is no special ordering of 
the strokes, we can impose one arbitrarily and solve using the following recurrence 
relation; 



r 

I max(C(5 ’) + E{S-S)-/i) 



if 5 is empty 
otherwise 



( 2 ) 



In Equation 2, S represents a subset of strokes on the page, 5 ’is a hypothesis con- 
taining the stroke in S with minimum ID, or no explanation for that stroke, and C is 
the confidence of that explanation plus a times the strokes it explains, or 0 if the 
minimum stroke is left unexplained. 




(a) Initial map 



(b) After connector detection 





(c) After all detectors 



(d) After resolution 



Fig. 7. Hypothesis framework, (a) Initially map is empty, (b) Connector detection inputs three 
conflicting hypotheses (XI, X2, X3). (c) The rest of the detectors execute, adding container 
(C), horizontal range (H), vertical range (V), and margin notes (N) to the map. (d) Resolution 
selects the most likely hypotheses (C, X2, and N). 
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Evaluation 



Our evaluation goals were two-fold. First, we wanted to understand the accuracy of 
the complete system. Second, we wanted to understand the effectiveness of the reso- 
lution process. We thus measured the accuracy of each of the detectors, and compared 
those numbers with the final system accuracy. 

Our test set consisted of -100 heavily annotated web pages containing 229 under- 
lines, 250 strikethroughs, 422 containers, 255 callouts and 36 vertical ranges. To 
simplify accounting, we unify grouping errors and labeling errors into one unit. In 
other words, an annotation is correct if it is grouped and labeled properly, otherwise it 
results in a false negative and possibly multiple false positives. 



Table 1. Results from running the individual detectors prior to resolution. 





Correct 


False positive 


False negative 


Underline 


219 


183 


10 


Strikethrough 


244 


99 


6 


Blob 


400 


6 


22 


Callout 


206 


529 


49 


Margin bar 


35 


219 


1 



Table 2. System results after resolution induing percentage changes from the data in Table 1. 
Percentages are obtained by 





Correct 


False positive 


False negative 


Underline 


206 (-5.7%) 


24 (-69.4%) 


16(-l-2.6%) 


Strikethrough 


229 (-6%) 


35 (-25.6%) 


9(-tl.2%) 


Blob 


396 (-0.9%) 


6 (0%) 


25(-i-0.7%) 


Callout 


177 (-11.3%) 


31 (-195%) 


77(-l-ll%) 


Margin bar 


35 (0%) 


140 (-225%) 


1(0%) 



These results show that the system has reasonably high accuracy despite the inher- 
ent ambiguity in the problem, our small quantities of training data, and the compro- 
mises we have made in choosing our techniques such that the system could operate in 
real-time. Pre-resolution our detectors performed adequately for most classes except 
for callout - we have not been able to identify good features for callouts since most 
strokes on the page have objects at one endpoint or the other (callouts with arrow- 
heads are substantially easier to detect). We hope to learn useful features once we 
have collected a larger data set. The results further show that resolution significantly 
decreases the number of false positives without substantial change to the false nega- 
tives. This indicates that it is a reasonable strategy for this problem. 



Conclusions and Future Work 

We have presented an approach to recognizing freeform digital ink annotations on 
electronic documents, along with a practical implementation. The resulting recognizer 
facilitates all of the operations common to traditional digital annotations, but through 
the natural and transparent medium of digital ink. 





Recognizing Freeform Digital Ink Annotations 33 1 



Rather than constraining the user, we employ an extensible framework for annota- 
tion recognition which achieves high accuracy even for complex documents. Our 
work approximates an exhaustive search of possible segmentations and classifica- 
tions. This makes it possible to analyze a full page of ink in real-time, and can be 
applied to many other ink recognition problems. 

We have implemented our approach in a reusable software component, have inte- 
grated the component into a full system for annotating web pages within Microsoft 
Internet Explorer, and have evaluated its accuracy over a collection of annotated web 
pages. 

However, there are many ways we hope to extend this work both from an analysis 
standpoint and from a system standpoint. From an analysis standpoint, we have made 
numerous compromises both for efficiency and due to sparse amounts of labeled data. 
We are currently collecting and labeling annotation data and hope to explore fast data- 
driven alternatives our current heuristics for detection and resolution. 

From a system standpoint, we have yet to corroborate our intuitions with user stud- 
ies of the full system. In addition, many of the structures that we recognize, such as 
boxes and connectors, are also common to other types of sketching such as flow 
charts and engineering diagrams. Our efficient inference algorithm should also extend 
to these domains. Furthermore, it should be possible for users to customize the system 
with their own annotation styles if they are not supported by our basic set. Finally, we 
are interested in examining other creative ways in which annotations can be used once 
they are robustly recognized. 
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Abstract. The computer transcription of handwritten Pitman’s shorthand has 
enormous potential as a means of rapid text entry to today’s handheld devices. 
Recognition errors caused in pattern segmentation and classification raises the 
incidence of ambiguous interpretation in existing systems and the paper pro- 
poses a well-established unigram technique and an efficient heuristic method to 
reduce ambiguity in a linguistic post processor. Heuristics applied in our tran- 
scription system are: - firstly, incorporating visual stimulus as used by human 
readers; secondly, applying knowledge of the most common words of Pitman 
shorthand; and finally, adding knowledge of collocation. An experiment using a 
phonetic Lexicon of 5000 entries shows the distribution of ambiguity in a 
shorthand lexicon due to the similarity of outlines’ and estimates the transcrip- 
tion accuracy of 94%. 



1 Introduction 

Pitman shorthand is a speech-recording medium practiced at a practical rate of about 
120-180 words per minute. It records speech phonetically as shown in Figure 1, and 
transliteration of phoneme primitives into orthographic target script relates to tech- 
niques applied to speech recognition systems such as Hidden Markov Model (HMM) 
and statistical language modeling techniques, which are used to convert a given utter- 
ance to the most likely word. 



Consonant 


Input outlinos 


Korognisod outlines 


T 

D 


J 


^ hat 


4 


= bat, pat, bad, pad 


P 


\ 


4 




B 


\ 




= bet, pet, bed 



Fig. 1. Samples of Pitman shorthand notation and an illustration of ambiguous interpretations 
for a single input outline. 

The potential of Pitman shorthand [1] as a means of rapid pen-driven text entry to 
a computer was firstly reported in the 1980’s. Later works by Qiao & Leedham [2][3] 
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used statistical phonetic models to validate all legal pairs of primitive combination 
and reduce ambiguous candidates in the final text interpretation. The most recent 
work by Nagabhushan et al [4] introduced a shorthand specific dictionary-lookup 
method and concluded that further work is required in homophones {outlines which 
are written similarly, but have dijferent interpretations) resolution area. This is a 
critical bottleneck in the recognition of human handwritten shorthand and a novel 
approach of homophone deduction by the use of unigram and contextual knowledge is 
proposed in this paper. 



2 Unigram Approach 



In general, homophones are caused by two factors which start in the recognition 
stage: - firstly, inability to differentiate between thick and thin pen strokes and sec- 
ondly inability to detect an exact location of a vowel notation against a consonant 
kernel. These cause a script to be ambiguous and raise the occurrence of homophones. 

“First and Last Segment Algorithm (F&L)’’ is a unigram approach used to filter 
ambiguous outlines in our system. The motive behind this approach is human readers 
rely on the legibility of the first and last segments of an outline in real text transcrip- 
tion. This finding is based on the transcription of 10 samples of shorthand notes by 
three professional Pitman writers where 55% of their transcription is done by the 
legibility of the first and last pattern primitives. The algorithm can be defined as 



P(x.) 



P{fi)P{h) 

J^P{fi)P{li) 



( 1 ) 



where P(x-) is the normalized probability of the first and last segment and {f^, f 2 , .../„) 
and denotes the first and last pen strokes in terms of n ambiguous candidate 

outlines. 

Another resolution to filter ambiguous outlines is the use of “Anchor Node (AN) 
Algorithm” in which a segment with the highest accuracy is considered as an anchor 
node and the search is based on local variables i.e., descendant primitives of the an- 
chor segment. The algorithm can be defined as 

P(y) = max(P(a-), (2) 

where P(y) is the probability of an anchor segment and (aj, 02 ,.., aj denotes primi- 
tives of a candidate outline. 



3 Heuristic Approach 

“Word Frequency (WF) Algorithm” is a word level filter used to select the most prob- 
able word from a potential list of homophones. In order to cover a general domain, 
our transcription system uses the first 5000 of the most frequently used English words 
and each word is tagged with its corresponding frequency value. If an isolated outline 
is related to a list of homophones, a candidate with the highest frequency value is 
chosen as a potential successor. The algorithm can be defined as 
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/=l 



(3) 



where P(f-) is the normalized probability of the frequency of a word in general Eng- 
lish and {hj, h 2 ,.., hj denotes homophones of a single outline. 

To clarify the “F&L”, “AN” and “WF” algorithms, consider the simple example 
shown in Figure 2. As shown in the example, the final word selection is done by the 
maximum likelihood probability of P(WF)nP(F&L) nP(AN). 
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Fig. 2. Illustration of transcription based on “F&L”, “AN” and “WF” algorithms. 



4 Implementation and Experimental Results 

The current goal of our experiment is to analyse the increase in similar outlines’ with 
the growth of a shorthand lexicon so that we can estimate which vocabulary-level 
contains the most similar outlines {i.e., homophones) and which level has the least. 
Input to our system is a lexicon of up to the 5000 most frequently used English words, 
containing Pitman shorthand as indexes. 

Experiment One in Figure 3 simulates perfect segmentation and recognition of an 
outline and demonstrates how a given outline can be uniquely identified in different 
sizes of shorthand lexicon. The experimental results show that about 94% of the 5000 
most frequently used English words have unique shorthand notations. The maximum 






Post-processing of Handwritten Pitman’s Shorthand 335 



ambiguity is 5 potential words per outline and an average ambiguity is 3 potential 
words per outline for the 6% without unique notations. Therefore a transcription accu- 
racy of at least 94% can be estimated if there is no error in the prior outline segmenta- 
tion and classification. 

Experiment Two in Figure 3 simulates recognition under line thickness ambiguity. 
This is the most likely case in real-time recognition as most digitisers cannot detect 
line thickness, whereas Pitman defines similar sounding consonants by the same 
stokes and differentiates between voiced and unvoiced sounds by thick and thin lines. 
The experimental result shows that outlines’ uniqueness drops by about 4% given line 
thickness ambiguity and, therefore, the transcription accuracy at best can be expected 
to be 90% in the actual recognition of Pitman shorthand. 

Experiment Three in Figure 3 simulates recognition under vowel ambiguity. This 
is an essential consideration in the recognition of Pitman shorthand as it is a common 
practice of stenographers to omit vowels in an outline, and omitted positions are un- 
predictable as they vary widely from writers’ experience or individual inclination. If a 
solution to the unpredicted omission of vowels in an outline is hy the exclusion of 
vowel notations from the shorthand lexicon and matching up segmented primitives 
without vowel components, the new version of the lexicon is expected to have about 
71% unique outlines. A strange peak i.e., 85% unique outlines occurred around the 
lexicon size of 4000 shows that outlines are more easily distinguishable without 
vowel components in a vocabulary level of 4000 words, but 4000 words is not enough 
for general vocabulary level. 



1. Uniqueness of outlines for perfect recognition 

2 Uniqueness of outlines given Ime thickness 
ambiguity 

3. Uniqueness of outlines given vowel ambiguity 



Fig. 3. Analysis on shorthand Lexicon with the 5000 most frequently used words. 
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6 Discussion 

Current findings are based on a shorthand dictionary and further experiment needs be 
done on real handwritten data. Algorithms mentioned in the paper such as “F&L", 
“AN” and “WF” need to be put into practice and further evaluation needs to be done 
on the whole of a word level and sentence level transcription. 
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Abstract. In this paper, we propose a method for handwritten text characteriza- 
tion based on a multiscale and multiresolution drawing analysis. The approach 
lies on the definition of four complementary handwritten text visual dimen- 
sions: the macro and micro orientation (obtained with a frequencies multiscale 
image analysis), the text linearity (defined by the merge of connected compo- 
nents), the curvature (measured as a multiresolution high profile deformation) 
and the complexity (expressed as multiscale drawing distribution entropy). Each 
feature is studied in an evolution graph that can he expressed as a unique hand- 
written curve signature. It leads to a description in separable writers families 
having individual visual characteristics. The results are very promising. 



1 Introduction 

Handwritten documents of 18* and 19* century authors like Montesquieu or Flaubert 
constitute rare collections that are preserved in libraries or specialized institutes and 
are recently associated to innovating digitalization projects. Among those manuscript 
collections from 18* and 19* century, we have been interested by an impressive, rare 
and complete collection of French manuscripts written by Montesquieu, a well- 
known author of 18* century. Montesquieu handwritten collection has been marked 
by intensive uses and manipulations that is expressed by a poor documents visual 
quality, with handwritten texts that are often erased with multi-writer annotations and 
corrections, see figure 1. 




Fig. 1. Example of Montesquieu’s autograph extracted from “De TEsprit des Lois” (1789). 
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Montesquieu manuscripts are characterized by a great diversity of writers: they 
were supported by more than twenty secretaries having all visual segregating charac- 
teristics of writings. Our purpose here is to prove that it is possible to identify differ- 
ent writers belonging to well identified writer’s families by considering macroscopic 
visual characteristics with an analysis of writings forms based on the study of multis- 
cale visual characteristics: the curve orientation, the drawing linearity, the multireso- 
lution curvature and the global complexity as a measure of drawings density. In this 
study, we are trying to evaluate the writers’ variability in a same page and in a same 
book. To do that, we propose a complete handwriting classification methodology into 
visual similar clusters. In the field of handwritten text recognition, we often consider 
the recognition task as a pure omni-writer problem, considering that letters and words 
can be processed independently and sequentially. We have chosen to consider hand- 
written text by considering visual emergent properties in the way of human visual 
expertise like Nosary in [5]. The originality of our approach is to consider a multis- 
cale approach from a macroscopic (linked to the page dimension) to a microscopic 
(linked to the grapheme or the letter dimension) analysis. The macroscopic analysis 
leads on the consideration of texture properties of lines drawings, like it has been 
introduced by Kuckuck in [3] and developed later by Said in [7] with multi-spectral 
text images decomposition. 



2 Our Proposition: A Multiscale Handwriting Approach 

In this paper, we propose a multiscale approach to characterize and classify handwrit- 
ten text based on four complementary dimensions: 

- The orientation', is evaluated with a local frequencies analysis based on a hierar- 
chical decomposition of the image area. Within this approach, pages can be seen in 
different scales and resolutions and allow to access different levels of information. 
Gabor filters have been widely used by Jain in [2] to detect specific oriented infor- 
mation in a document. We use, in this work, a very similar approach, based on 
Hermite transform. Hermite filters decompose a localized signal by a Gaussian 
window into a set of Hermite orthogonal polynomials [4]. 

Hermite filters are separable both in spatial and polar coordinates, so they can be 
implemented very efficiently. In the frequency domain, these filters are Gaussian- 
like band-pass filters and filters of increasing order analyze successively higher 
frequencies in the signal. They allow a local analysis which is very close to the 
Gabor one [6]. Figure 2. a presents the Hermite decomposition of the document of 
figure 1 at a given scale and up to degree 2 (lowest frequencies) as well as the re- 
constructed image Ij^, using this decomposition, and the difference Ip with the 
original image. We use image Ip (highest frequencies) to localize the text on the 
page and a combination of information extracted from the four quadrants to char- 
acterize the text. 

- The linearity (defined from the evolution of connected components in text lines, 
see figure 3). The initial image is filtered through directional bank filters (in the di- 
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Fig. 2. Example of Hermite decomposition, (a) Hermite decomposition of the document of 
figure 1 at a given scale and up to degree 2 (h) a part of the reconstructed image, using decom- 
position of (a), (c) difference with the original image. 




Fig. 3. Evolution of text representation with decreasing resolution and increasing line fusion. 



rection of the main orientation detected with an autocorrelation analysis, see [1]) 
and is processed with a Difference of Offset Gaussian method (DooG). This ap- 
proach is well adapted to drawings segmentation and does not need a priori knowl- 
edge of page content. The processed image reveals high frequencies in the con- 
tours of connected line drawings. We compute then an average number of 
connected components in the filtered image in the main direction of the drawing. 
The linearity reveals the size of existing connections that can be found between 
letters and words in a same text line. This dimension underlines visual salient dif- 
ferences between inter-word spaced and continuous handwritten drawings. The 
evolution of the linearity value through the decreasing resolutions expresses the 
connection rate between handwritten components of the analyzed block (from 
graphemes to words). 

- The cursivity is evaluated as a words profile deformation. This dimension ex- 
pressed the ratio that exists between the high and law words profiles (line con- 
tours), as illustrated in figure 4. The DooG approach is recursively employed to 
extract forms contours that are not necessary limited to words; in the first high 
resolutions of drawing block, the connected contours can enclose simple graph- 
emes whereas in the low resolutions (after the process with a bank of recursive 
convolutions), connected contours can describe large part of a text line including 
one or several words. The resulting multiresolution description of contours (with 
high and a low profile) allows evaluating the mean curvature ratio of each consid- 
ered resolution. 
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- The complexity is expressed as a entropy of curve distributions and expressed the 
drawing density. The entropy value is proportional to the graphemes size. Its esti- 
mation is based on the number of intersections between handwritten drawing lines 
and oriented lines. We estimate a probability of intersections between the text and 
the lines and we compute for each text block the complexity measure E(T) = 
pLog(l/p) = (l-p)Log(l/(l-p), where p is the maximal probability of intersection . 
The more the text is small written, the higher is the complexity. The complexity is 
estimated on a handwritten text blocks perceived in different resolution. 




Fig. 4. High and low profiles evolution for three resolutions. 



Each measure (complexity, cursivity and linearity) is then studied according to an 
evolution multiscale graph that simulates the perceptive behavior of a human that 
observes a drawing from close to far. Each feature is iteratively recomputed in suc- 
cessive decreasing resolution images and reported in a 2D-graph. For each graph 
(each dimension corresponding to a particular drawing) we estimate a single factor 
that expresses a unique signature of handwritten drawing lines. For the complexity, 
we compute the couple (Ej^j^^, Ej^^^) corresponding to the minimal and maximal en- 
tropy value obtained through resolution changing; for the linearity, we estimate two 
derivatives obtained in the decreasing curve of measures; for the cursivity, we com- 
pute a multiresolution deformation from the evolution through resolution of the high 
profile lengths. The resulting 2D-graphs obtained for each feature lead to the forma- 
tion of visual clusters that are considered as handwriting families. Figure 5 illustrates 
the complexity evolution in a set of relevant handwritten text extracts. 




Fig. 5. Multiscale description of complexity handwriting. 
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The final handwritten text block categorization leads on the combination of all in- 
dividual dimensions analysed through a k-means based classification. The orientation 
is processed separately. Currently, we have chosen to consider 5 separable writers 
families for each feature. Within this approach, we are able to create 38 different 
families among the 50 initially considered having sometimes strong visual similari- 
ties. This approach is a robust way to show complex, compact, spaced, large or in 
opposite thin drawings into visual significant clusters. 



3 Conclusion and Prospective Works 

This work is a part of a project of writers’ identification and is applied to Montes- 
quieu’s handwritten corpus. We have chosen to analyze the pages in their globality 
(with a macroscopic multi-dimensional approach of handwriting characterization, 
with four complementary features of orientation, complexity, curvature and linearity), 
and to give additional precisions on handwritten drawings with a multiscale approach. 
This work is currently completed with a fine writer analysis that characterizes pre- 
cisely the inner writer’s stability and the relevant differences between writers. 
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Abstract. We propose to combine a feature descriptor method with a 
structural representation of symbols. An adaptation of the Radon trans- 
form, keeping main geometric transformations usually required for the 
recognition of symbols, is provided. In order to improve the recognition 
step we directly process on the grey level document. In this perspective, 
a three-dimensional signature integrates into a same formalism both the 
shape of the object and its photometric variations. More precisely the 
signature is computed within the symbol following several grey levels. 
Additionally a structural representation of symbols allows to localize 
into the document candidate symbols. 



1 Introduction 

On the early years, the powerful development of the Internet and the large in- 
creasing of storage capabilities have given rise to high interests for the symbol 
recognition problem [1] . Usually documents are scanned in grey levels and after- 
wards a binarization step is carried out to extract symbols from the background. 

A lot of approaches have been proposed during the last years for recognizing 
symbols. Among these approaches some are based on features descriptors [2-5]. 
Two kinds of descriptors are often encountered: those that work on an object 
as a whole and those that work on the contours of the objects. In general such 
approaches based on feature descriptors are robust against noise and occlusions. 
Region-based descriptors are less sensitive to noise and contour-based descriptors 
to occlusions. Nevertheless, an important drawback of all these methods is that 
complete symbols in a document must be clearly segmented, which is in itself 
an ill-posed problem. Typically, symbols are quite always embedded with other 
graphics layers. Moreover they can touch, partially overlap each other or intersect 
other lines. 

Other approaches focused on the structural shape representation [6-10]. 
Among data structures, graphs are usually suitable to support the structural 
representation of symbols. More generally, a symbol is described into a set of 
segments and the whole segments and the spatial relations between them are 
represented by an attributed relational graph. The computational complexity of 
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graph matching techniques is rather large even though efficient algorithms have 
been proposed for some classes of graphs. Moreover these approaches suffer from 
the noise introduced by the segmentation process even if statistical assumptions 
can be made on that noise [8]. However structural approaches are powerful in 
terms of its representational capabilities by spotting target symbols. 

In this paper, we propose to combine a feature descriptor method with a 
structural representation of symbols. Documents are often perturbed by noise 
(digitalization, document quality...) and a binarization step gives rise to artifacts 
or some parts of the symbols can be removed following a threshold set. In order 
to improve the recognition process we directly work on the grey level document. 
We make here an original adaptation of the Radon transform keeping main 
geometric transformations usually required for the recognition of symbols. A 
three-dimensional signature integrates into a same formalism both the shape 
of the object and its photometric variations. More precisely the signature is 
computed within the symbol following several grey levels. So noise effects on the 
boundary of the processed object are insignificant considering its whole shape. 
Additionally on the document we compute the skeleton obtained by the medial 
axis [11]. Key points (junction points) on the skeleton are organized into a graph 
where the graph edges describe the link between junction points. From this 
representation, candidate symbols are selected and validated with the signature 
descriptor. 

The paper is organized as follows. In section 2 and 3 we recall the definition of 
the Radon transform and its extension to recognize grey level symbols. Then, we 
describe our symbol localization process using a structural representation based 
on a decomposition of the document into its skeleton (section 4) . Examples and 
experimental results are given in section 5 and future investigations are provided 
in section 6. 



2 Signature of Grey Level Symbols 

The Radon transform T^j/ of a function / is defined as follows [12]: 

/ OO POO 

/ f (x, y)S{xcos{e) +y sin{0) - p)dxdy . (1) 

-OO j — OO 

where — oo<p<oo,O<0<7r and 5 is the Dirac function. Intuitively, that 
remains to integrate the function / along a line for any parameters (p, 9) 

We use here a shape measure, called 7^-signature [13] which is defined from 
the Radon transform: 

/ OO 

T^f{p,9)dp. (2) 

-OO 

2.1 7?.-Signature Properties 

It is easy to show, from basic Radon properties, that the following properties 
are checked for the 7?.-signature: 
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Periodicity of tt: 

/ OO 

Tlj{p,0±TT)dp = TZf{9) . 

-OO 



(3) 



— Rotation: a rotation of angle 0q on / implies on the 7?.-signature a cyclical 
shift depending on 0q: 

/ OO 

T^f{p,0+0o)dp . (4) 

-OO 

— Translation: a 7^-signature is invariant to a translation of tt = {xq, yo) on /: 

/ OO 

T'^fip - Xocos{0) - yosin{0))dp = TZf{0) . (5) 

-OO 

— Scale: a zoom of a yf 0 on / implies that: 

/ T'^f{o!p,0)dp= —TZf{0),a>O . (6) 

J — OO tt 



2.2 7?-3i3-Signature Definition 

Generally a bidimensional representation of the signature is sufficient to reach 
accurate results considering only the shape of the object. Nevertheless such an 
approach is not suitable to recognize grey level symbols. There exists two ways 
to manage with the photometry during the definition of the signature. First 
the luminance can be directly integrated onto the scan line. In this case we 
have noticed a smoothing of the 2D signature which is in fact less discriminate. 
Another idea is to consider the image as a set of level cuts performed from 
successive binarizations (for each grey levels contained in the image) and so to 
define a 3D signature. A simple approach to compute an -signature might be 
to calculate for the signature associated to each cut and to append all of them. 
Let O be an object composed of 4j^ng grey levels and let Xi be a binary cut of 
threshold i applied on O. We have O = with X^ng C X^ng-i C 

. . . C Ai. So, TZsd = G{i?Ai}i=i,#ng with R\. the signature calculated on the 
binary image Xi performed from a binarization of threshold i and #ng is the 
number of grey levels of the image. However such successive binarizations rely 
to the computation of Radon transforms and should be expensive in processing 
time. The use of an accumulator during the definition of the Radon transform 
allows to take into account directly the grey levels of the image. A point with a 
given value appears on all the high level cuts and obviously in each associated 
signature. It is enough to integrate this information during the calculation of 
the Radon transform of this point. Then only one Radon transform is required 
to define an 7?.3 d - signature. Figure 1 presents the signature achieved on a grey 
level symbol. 
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Fig. 1. Example of 7?.-signature. 

2.3 Properties of T^-so-Signatures 

By definition an 7?.3£)-signature consists in a set of 7?.-signatures checking sepa- 
rately the properties described previously (see section 2.1). As the photometry is 
invariant to the rotation and the translation of the object, the invariance proper- 
ties of rotation and translation of 7^-signatures are kept for the T^a^i-signature. 
The scale factor is taken into account by dividing each 7^3 -signature by its 
volume during the matching process. Figure 2 shows error variations due to 
the application of basic geometric transforms on a grey level object. The values 
achieved are close to zero (except for high stretching values and negative zoom) . 

2.4 Noise Effect 

Noise has been added to a synthetic image of a sphere. Table 1 provides the 
difference in term of percentage between the original signature and degraded 
versions. The low differences attest of the robustness of our method. Few dispar- 
ities from the global structure of the object have weak influences on the global 
signature of the object. 



Table 1. Differences between a sphere and its degraded versions, max : maximum 
error, fj, : mean error. RMS : root min square error. 
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Fig. 2. Basic geometric transforms. X-axis: Transforms; Y-axis: Error variations. The 
original symbol is defined into the square area. 



2.5 Distance Between T^-so-Signatures 



Consider two discrete 7?.3D-signatures and Let and R^jj be the 
normalized signatures and p be the number of orientations. The similarity ratio 
SR (expressed as a percentage) between R^j^ and R^jj is defined as follows: 



= ^min (7^f.(6»),7^f.(6»-ha:)%p)) . (7) 

2=1 



M^^{9, x) = max (Rf. {9),Rf. {9 + x)%p)) . (8) 

i=l 



SR = 100 max 

a;G[0,p[ 






( 9 ) 



Hence SR is not sensitive to symbol scaling. Moreover, note that circular 
shifts 9 are applied to R^jj, and SR is obtained by maximizing the classical 
Tanimoto index (min over max). In that way, SR is not sensitive to object 
rotation either. 
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3 Symbol Extraction 

We consider a set of symbols to be found in a graphical document. Two cases 
are studied: either the symbols are disconnected from the graphic layer network 
or linked to the network with only trihedral junctions (order 3). 

3.1 Disconnected Symbols 

In this section, we focus on disconnected symbols obtained after a rough seg- 
mentation of noisy documents. Each grey level symbol is localize using a support 
(mask in figure 3) defined from its maximal connected component. We consider 
here components having a consistent size to not study too weak (assimilated 
to noise) or to large structures (greater than the symbols contained into the 
database). The selected components are filled to take into account the global 
shape of the symbol. Figure 3 presents the processing performed to extract one 
symbol in a line-drawing document. 





Finally, the support region is directly applied on the grey level document 
and the extracted area is matched with each model belonging to the database 
using the similarity ratio (see section 3). The target symbol corresponds to the 
maximal similarity ratio. 

3.2 Symbols Linked to the Network 

We present in this section a primary approach allowing to recognize basic sym- 
bols linked to a network. We consider in this application symbols having a regular 
shape (rectangular or circular for example). The junction points between the ob- 
ject and the network are assessed using common pruning and skeleton extraction 
steps (junctions of order 3 are processed here). The T 34 distance transform [14] 
is performed on a rough binarization of the grey level document to compute the 
skeleton. Figure 4 shows a symbol to be found in a document and an assessment 
of a possible junction scheme for it. 

In this case, two pairs of junctions related to a connection to two network 
lines have been set. A directed graph is then defined for any configuration of 
junctions. For example if a symbol is linked to one line the graph consists in two 
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Connected symbol Binary symbol Skeleton 




Fig. 4. Localisation of a connected symbol. 



nodes, and four nodes for two lines... Such a graph is assumed to be the maximal 
boundary of the symbol. The retrieval of an object in a document is then guided 
by the detection of loops in the skeleton image following the orientations of the 
directed graph associated to an object. Then we search all the possible loops 
related to such graph. For each loop, the associated graph is extracted from the 
skeleton image by removing points exterior to the junctions (and by definition 
exterior to the current loop) . It remains to fill the area delimited by such graph 
in order to define the mask to be applied in the document image. Nonetheless 
using this approach the symbol should be cropped in the document because it is 
limited by its exterior skeleton. So in order to grow the mask up to fully consider 
the symbol, the inverse distance transform is carried out onto the values of the 
T34 transform associated to the extracted loop. Such process allows to extend 
the area delimited by the directed graph to the boundary of the object. At last, 
the same processing described in the previous section are done: filling in the 
mask area and matching with the database of symbols. 

It is easy to extend the proposed approach to the retrieval of broad symbols 
by considering a directed graph definition for each object. The main idea is to 
manage with strong graph representations defined following the set of junction 
points peculiar to the symbol structure. The directed loop is then set from the 
external hull of the skeleton representation of an object. Let x be the number 
of nodes of this loop. Finally, the search of symbols linked to a network by k 
junctions remains to the search of loops of cardinality x' = x + k in the skeleton 
image. 

4 Experimental Results 

In this section two examples of symbol recognition process are provided. On the 
one hand we show that the use of grey level information increases the accuracy 
of the recognition step. On the other hand, we show how a structural description 
of a symbol improve the symbol spotting. 
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Fig. 5. Example of isolated symbols. 
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Fig. 6. Symbol ratio {SR: formula 9) between binary symbols. 

4.1 Disconnected Symbols 

A database of ten symbols having relatively close shapes is provided. Figure 6 
shows an application based on binary symbols found in figure 5. In this case noise 
brings artifacts and hence errors to the signature of the object. As a consequence, 
very close clusters (both circles and squares in figure 6) give rise to miss-classified 
symbols. That is the scores are high for similar shapes. The use of grey level data 
allows to overcome such problem by improving the classification (see figure 7). 
In this case, near clusters are better differentiated. 
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Fig. 7. Symbol ratio {SR: formula 9) between grey level symbols. 



4.2 Connected Symbols 

The figure 8 presents the main steps performed to extract a symbol linked to a 
network by two lines in a grey level document: that is, the skeleton computed 
from a rough segmentation, the areas extracted from the loops associated to the 
directed graph of the object and the location of the symbol (the scores in black). 

Another symbol extraction is given figure 9. Here a square symbol linked by 
tree lines is extracted from the network and recognized. In this case the loop 
consists in an odd number of nodes. 

In term of complexity, the most costly is the enumeration of the all the loops. 
The fastest known algorithm has an upper time bound of 0((n+e)(c+l)), where 
e is the number of edges, n the number of vertices and c the number of loops. 
Typically, we don’t search for all the loops. We only focus the search of those 
with a number of edges inferior to about 10. This step is very time expensive, 
but it can be done offline. 

5 Conclusion 

The use of grey level information in noisy document allows to better recognize 
symbols because it limits the loss of accuracy due to the segmentation process. 
We have also drawn a way to handle with connected symbols in line-drawings 
using a structural description. 

Although this method can only be applied to symbols with convex envelope, 
these results are very promising. However they still need further validation by 
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Fig. 8. Connected symbol recognition. 



processing much larger databases of technical drawings, with more graphical 
symbols. Currently, we try to extent our method to the retrieval of any symbol 
by considering a structural representation taking into account partially occluded 
symbols. 
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Fig. 9. Another connected symbol extraction. 
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Abstract. Symbol recognition is one of the central problems in the held 
of graphics recognition. Many methods and approaches have been devel- 
oped in the context of several application domains. In the last years, the 
need for generic methods, able to perform well on large sets of symbols in 
different domains, has become clear. Thus, standard evaluation datasets 
and protocols have to be built up in order to be able to evaluate the 
performance of all these methods. In this paper we discuss several points 
which should be taken into account in the design of such evaluation 
framework, raising a number of open questions for further discussion. 
These issues were the starting point of the organization of the contest on 
symbol recognition held during the last Workshop on Graphics Recog- 
nition (GREC’03). We also summarize the main features of the dataset 
and the protocol of evaluation used in the contest, as a first step to define 
a general evaluation framework, giving answer to these open questions. 



1 Introduction 

Performance evaluation has become an important field of interest among com- 
puter vision researchers. As more methods are available to solve a particular 
problem, there is a need to evaluate and compare them in order to determine 
strong and weak points for each of them. Thus, image databases, performance 
metrics and performance evaluation protocols have been developed in several 
areas of computer vision. In document analysis, some examples can be found in 
the performance evaluation of thinning algorithms [1] and OCR systems [2]. 

In the last years, the graphics recognition community has also become aware 
of the importance of evaluation and several contests concerning different graphics 
recognition problems have been organized using the framework of the Interna- 
tional Workshop on Graphics Recognition (GREG). These contests have focused, 
up to now, on the raster-to- vector conversion of line drawings, specifically on the 
detection of dashed-lines [3], the general vectorization problem [4, 5] and the de- 
tection of arcs [6]. As a result of this effort, several metrics and protocols for the 
evaluation of line detection algorithms have been developed [7-9] . 

Another important subject in the development of any graphics recognition 
system is symbol recognition. In the conclusions of the past editions of the GREG 
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Workshop, it has been pointed out the need for more generic symbol recognition 
systems and so, the need for a framework to evaluate them. The first attempt to 
address this issue was carried out in the more general framework of the 15th In- 
ternational Conference on Pattern Recognition (ICPR’OO) [10], where a contest 
was organized using a limited set containing 25 symbols from one single domain, 
electronics. These symbols were scaled and degraded with a small amount of 
binary noise to generate the test data. A more general approach to the evalu- 
ation of generic symbol recognition systems was attempted in the last edition 
of the GREC Workshop [11], where 50 symbols from the domains of electronics 
and architecture were considered. Models of binary degradation and vectorial 
distortion were applied to the data to generate noisy images. 

These contests have to be taken as the first steps in the definition of a general 
framework for performance evaluation of symbol recognition methods. In fact, 
the evaluation of symbol recognition is not a straightforward task. As we will 
show in the next sections, symbol recognition is a research field covering a lot of 
aspects about the application domain, the kind and the format of the data, the 
noise and the degradation of images, etc. Moreover, there are a lot of specific 
methods and approaches which are only capable to deal with some of these 
characteristics. Some reviews about the state-of-the-art in symbol recognition 
can be found in [12-14]. A general framework for symbol recognition evaluation 
should take into account all these factors, and should be flexible enough in order 
to be applied to most methods and approaches. 

In general, a framework for performance evaluation must include the data set 
and the ground-truth, the definition of some performance metrics, and the de- 
scription of a protocol to run a method on the data set and compare the results 
with the ground-truth. When trying to define such a framework for the evalua- 
tion of symbol recognition, a lot of open questions arise due to the complexity 
and diversity of the symbol recognition problem. In this paper we review these 
open questions, according to our experience in the organization of the symbol 
recognition contest held during GREC’OS. We begin, in section 2, with a review 
of the main characteristics of symbol recognition, this description being essen- 
tial to understand what are the main issues for a performance evaluation in this 
domain. Then, in section 3 we focus on the questions concerning the data set 
and the ground-truth. Performance metrics are discussed in section 4 and the 
framework that has to be built up for evaluation in section 5. Finally, in section 6 
we explain the framework defined for the first edition of the symbol recognition 
contest. This framework can be seen as a first step to give answer to the ques- 
tions discussed in previous sections. In section 7 we state some conclusions and 
draw several issues to work on in order to reach a general framework. 

2 Characteristics of Symbol Recognition 

2.1 Domain 

Symbol recognition is a research topic related to a lot of application domains 
(architecture, electronics, mechanics, etc.) The symbol, whatever the type of a 
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document, expresses in a compact way a known element that would be time- 
consuming and complex to represent each time it appears on a document. Thus, 
a symbol can be defined as a graphical entity with a particular meaning in the 
context of a specific application domain. As a result, the understanding of a 
document is possible only if the symbols it contains are known. 

The representation of a symbol may include some graphical primitives, such 
as segments, arcs of circle, solid regions, solid and dashed strokes... but also 
in some domains, some arbitrary shapes, textures or color. As noticed in [14], 
symbols can be composed of different subsets of these graphical primitives, de- 
pending on the application domain, leading to different visual properties and 
configurations. Some examples of this diversity are presented in figure 1. This 
large variety of data has led to the design of a large number of recognition meth- 
ods, based on different approaches, including some ad-hoc methods, specifically 
designed to recognize a family of symbols, sometimes in a particular context. 





Ilil 



Fig. 1. Some examples of various symbols of different application domains. 



2.2 Data Representation 

Whatever the application domain, data can be available in several formats, de- 
pending on the nature of the original source. In symbol recognition, two represen- 
tations are mainly handled: bitmap images and vectorial images. Bitmap images, 
usually produced by original document scanning, may be encoded using several 
color models: binary (black and white), grey-level or color. Vectorial images, 
produced by Cad softwares or after applying a vectorization step on bitmap 
images (see below), are represented using graphical primitives (arcs, segments, 
...) and attributes (color, thickness, stroke style...) 

Data representation often determines a kind of recognition approach. Thus, 
a bitmap document, carrying semantically poor information, is usually analyzed 
by statistical methods. A vectorial image, carrying more symbolic information, 
leads to “high-level” recognition methods, generally based on structural and syn- 
tactic approaches. A vectorization step allows to convert a bitmap image into 
a vectorial description, and then to apply “high-level” recognition methods on 
bitmap images. But this approach raises new issues, as vectorization itself is a 
critic process in document analysis. It has been the subject of past contests [4, 
5], and even if it has been widely studied [15] and could overall be considered as 
a “mature” process, there is still room for improvement. In particular, vector- 
ization still leads to typical artifacts, as inaccurate location of junctions, small 
segments... Moreover, different vectorization methods yield different results, and 
some of them can be better suited for a given recognition method. 
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2.3 Segmentation 

A graphical entity is known as segmented whenever all the graphical parts of 
it are plainly identifiable. In the case of symbols, previous segmentation makes 
easier the recognition task, as the location of the symbol is known and the 
amount of data to be considered is reduced. But in a lot of documents, symbols 
are not segmented, and it is not easy to apply a previous generic segmentation 
process. In fact, in these documents, recognition sometimes raises a paradox: 
to be able to recognize symbols, they have to be segmented, and to be able 
to segment, symbols have to be recognized. Thus, recognition methods have to 
identify symbols as well as other graphical parts of the document (the graphical 
environment). 

Segmentation ability is influenced by several factors. On the one hand, symbol 
segmentation is domain sensitive. Whenever the graphical environment can be 
easily modelled, symbols can be easily segmented. This is the case for musical 
scores for example, and more generally for symbols that are not connected to 
other graphical parts of a document. On the other hand, symbol segmentation 
is also approach sensitive. Some recognition methods are well-known to be NP- 
complex when working on non-segmented data. Other approaches, like symbol 
signatures, are able to determine areas of interest, that are likely to contain 
symbols. Thus, they can be taken as a kind of segmentation which facilitates 
the recognition task. Clearly, one of the main issues in symbol recognition is the 
ability for a method, or a chain of methods, to segment. This still remains an 
open issue, as noticed in [14]. 

2.4 Degradation of Data 

Whatever the domain, the kind of acquisition or exploitation format, and the 
kind of symbols, data are often degraded. There are many different types of 
degradation. Sometimes, the original document is degraded itself, as an effect of 
photocopying, hand-drawing, rotation, scaling, resolution, manipulation... But 
degradation also occurs during the analysis of the documents: scanning, bina- 
rization, thinning, vectorization, etc. are well-known to produce artifacts. Some 
recognition methods work well on clean data, but are unable to handle some of 
the degradation types presented above. See [16] for a good presentation of the 
problem and [17] for an implementation of an automatic degradation method. 

3 Data for Symbol Recognition Evaluation 

3.1 Principles 

According to the general description of symbol recognition in section 2, several 
questions arise when trying to define the data to be used in the evaluation. 
Maybe, the first question concerns the application domain of the data. We have 
seen that symbol recognition covers different domains, that symbols in each do- 
main can be composed of different primitives, and that each domain can have 
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specific constraints for recognition. So, as a first conclusion, a general frame- 
work for performance evaluation should take into account this diversity of data 
by including a representative sample of existing symbols. Indeed, no recognition 
method should be favored by the choice of the data, and the diversity is essen- 
tial to assess the generic recognition potential of each method. Clearly, the more 
general the data set is (symbols mixed from all domains), the more we favor 
the development of general recognition methods. However, we can prevent some 
specific methods to take part in the evaluation, as some methods are only able 
to work on some specific domains and kind of data. One possible solution could 
be the definition of several data sets, one for each different domain or group of 
primitives. Then, we could asses the performance of a method for each indepen- 
dent set, but we could also combine performance on all sets to assess the generic 
recognition ability. 

A major division of symbol recognition methods is on which data format 
(bitmap or vectorial) they rely. The main problem concerns vectorial data as 
the method employed for vectorization can influence the results of recognition. 
Then, we should wonder if it is better to provide a common vectorization tool to 
be used for all recognition methods or if we should let each method use its own 
vectorization approach. In the first case, if we provide a common vectorization 
tool, we can favor the methods best adapted to that vectorization approach. 
In the last case, we are not evaluating only recognition, but vectorization plus 
recognition. In both cases, results could not be comparable. So another idea is 
to base the data set exclusively on vectorial format, and to generate bitmap 
images from the vectorial representation. This approach presents an obvious 
interest for performance evaluation. This kind of conversion is more conservative 
of the intrinsic properties of an image than the reverse operation, as the visual 
representation of a vectorial image and that of its conversion into bitmap format 
are the same. The ground-truth is also easier to build and recognition methods 
can use both formats. 

Segmentation is another important issue concerning the evaluation of sym- 
bol recognition. The question is whether to use real drawings including non 
segmented symbols, or if it is better to use only pre-segmented images contain- 
ing only one symbol. As with vectorization, in the last case we only focus on 
the evaluation of recognition and not on the evaluation of segmentation plus 
recognition. But, as there is not any generic method for segmentation and as 
real applications of symbol recognition must usually include segmentation too, 
it is also interesting to consider real drawings with non segmented symbols from 
different domains. In this case, the generation of the ground-truth becomes more 
complex and performance measures should take into account issues such as miss- 
ing symbols and false detections, the accuracy of location, and the accuracy of 
recognition itself. 

Finally, we should include in the data set images with the most usual kinds 
of noise and degradation in real drawings. Obviously those images should be 
highly representative of all possible sources of noise. However, as there are many 
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kinds of degradation, acquiring and annotating such a set from real images can 
imply a huge manual effort. Some alternatives are discussed in next section. 

3.2 Generation of Data and Ground- Truth 

The generation of the data set to be used in any performance evaluation process 
requires two fundamental tasks. Firstly, we need to collect a high number of 
images, representative enough of all domains and including all kinds of primitives 
and noise. Indeed, in order to achieve really significant measures, a lot of images 
are needed, from a lot of different sources and acquired under several acquisition 
conditions and constraints. All possible sources of noise and degradation should 
be taken into account: scanning at different levels of resolution, photocopying, 
printing at low resolution, old and degraded documents, hand-drawn documents, 
etc. Secondly, we need to annotate all these images in order to generate the 
ground-truth to be used in assessing the results of the methods. 

There are two different approaches to get the data set and the ground-truth: 
the use of real data or the generation of synthetic data. Clearly, real data allows 
to evaluate symbol recognition methods with an exact replica of what is to be 
found in real applications. Therefore, performance measures obtained using real 
data are good estimators of performance in real drawings. However, the use of 
real data has some disadvantages concerning the acquisition, the organization of 
the data, and the generation of the ground-truth. Indeed, the generation of the 
ground-truth for all these documents implies a lot of manual effort, also needed 
to get segmented images of symbols from real drawings. An alternative way to 
get the data set and its ground-truth, specially for segmented images, is the use 
of synthetic data, z.e. the definition of one or several methods to automatically 
generate a great deal of images. With this approach, the main problem is the 
definition of suitable models for the automatic generation of images resembling 
the usual kind of noise and variations found in real images: scanning noise, 
degradation of old documents, variability produced by hand-drawing, etc. And 
the ground-truth is easier to set up, as it is implicitly available. 



4 The Evaluation of Symbol Recognition 

Evaluation could be viewed as a global measure allowing to determine the “best” 
recognition method. In fact, for previous contests on vectorization or arc recogni- 
tion, some metrics have been defined with more or less success, as they sometimes 
favor some of the aspects which can be taken into account in the evaluation pro- 
cess. A similar approach could be taken for symbol recognition. However, because 
of the large number of variables concerning symbol recognition, it seems diffi- 
cult to define a single performance measure, a suitable metric and a complete 
set of evaluation tests, taking into account all possibilities described in previous 
sections. In fact, symbol recognition remains an active research domain, and it 
seems more interesting to focus on the understanding of the strengths and the 
weakness of the existing methods rather than to attempt to measure a global 
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performance. So it is a priori more suitable to consider the evaluation as a 
set of measures, each one corresponding to one specific aspect of recognition, 
determined from the characteristics of symbol recognition methods and data 
presented in sections 2 and 3. In a first time, it makes sense to measure each 
aspect alone, as some recognition methods are sometimes very specific and thus 
not able to deal with all of them. In a second time, some of these stand-alone 
criteria could be combined in order to get more global measures. 

Then, the problem is which measures provide a good evaluation of symbol 
recognition. It can be considered that the information we expect in all cases 
is the label of the symbols represented in a test image. In a segmented image, 
that may be enough, but for a non-segmented one, including possibly several 
instances of different symbols, this can not be satisfactory. In this case, labels 
of symbols have to come with at least information as location, orientation and 
scale. Then, some metrics should also be defined in order to manage the accuracy 
of these measures. From a general viewpoint, other metrics can be taken into 
account to test other evaluation aspects: 

— The number or rate of false positives and missing symbols. 

— Considering second/third candidates and confidence rate, if available. 

— Computation time: we can consider only recognition time, or we can also 
include the time used by other processes, such as learning, vectorization or 
segmentation. 

— Scalability, i.e. how the performance degrades as the number of symbols 
increases. We can measure it according to the degradation of recognition 
rates or according to the computation time. 

5 The Framework for Evaluation 

Whatever the evaluation criteria and data, an evaluation framework must pro- 
vide formats and tools allowing to exchange information about models, images, 
tests and results. It must also define a protocol to test a given method on the 
dataset. The first issue is about file formats of images. One assumption to be 
made is that image formats must not degrade the original image and must be 
freely available for all participants. We must work with two main kinds of for- 
mats: bitmap and vectorial. The bitmap format is already associated to a lot of 
solutions (such as TIFF, BMP or PNG format), even if some of these formats 
are more popular than other ones. For the vectorial side, some “standard” ways 
of representation also exist, such as DXF or more recently SVG. But for the 
evaluation purpose, it seems that these formats are maybe too sophisticated. In 
fact, a simple format, as the VEC format proposed by Chhabra [4] seems to be 
sufficient. Moreover, its simplicity allows its eventual extension, if required. 

The evaluation framework must also define several other file formats in order 
to describe precisely the data included in a test and the results achieved by a 
method. According to the kind of evaluation, several solutions may be suitable. 
Indeed, the goal is to find the best compromise allowing to express all useful 
information without obliging the participants in interfacing their recognition 
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methods with too sophisticated formats. The easier is probably to use some 
simple text files with a syntax close to those proposed by the . ini files, as 
it does not require any sophisticated parsing. However, when the information 
to describe becomes more complex (for example, the description of results with 
several measures for each image) XML seems to be an interesting flexible format. 
Moreover, the use of a DTD or a scheme helps to normalize data, avoiding most 
of description troubles. And associated with the XSLT style-sheets, it allows the 
extraction and the filtering of data, that can be automatically processed, both 
for participants (for interfacing purpose) and organizers (for automatic analysis) . 

Another basic idea is to give each participant the possibility to choose which 
tests he wants to compete in, according to the features of his method. To achieve 
this, each test has to be considered as a stand-alone part. Therefore, it has to 
contain all the information that a participant need to know about a test: which 
are the models involved in the test and which are the test images. This principle 
is also useful in some other situations. Thus, if a program crashes during a test, 
it is able to run the other tests. 

Finally, we want to point out that the availability of the framework (formats, 
data, etc.) is very important. In the context of the organization of a contest, 
information about formats and data is required for preparing the methods and 
learning purposes. But from a more general point of view, a contest is nothing 
else than a special occasion to evaluate performances. As noticed in [18], there 
is a need for a corpus and performance evaluation methods, in order to evaluate 
symbol recognition capabilities. Since GREC’97, this need still exists and there 
is still a lack of public domain data with associated ground-truth. 

6 The Contest on Symbol Recognition at GREC’03 

The contest on symbol recognition organized in the context of GREC’OS [11] has 
to be seen as a first step trying to give an answer to some of the questions exposed 
in the previous sections. It aimed to set up the basis for the definition of a general 
framework for performance evaluation of symbol recognition methods. According 
to the general considerations stated before, the main goal of the contest was not 
to give a single performance measure for each method, but to provide a tool to 
compare different symbol recognition methods under several criteria. 

As we wanted to promote a wide participation of the graphics recognition 
research community and because of the large number of possible options to be 
considered for defining the data set and the protocol of evaluation, a question- 
naire was designed to get feedback from the potential participants. The analysis 
of the answers to this questionnaire helped us to take the final choices. 

The first important decision was to use synthetic data for the contest. We 
decided to automatically generate all images because this way, we were able to 
generate a large amount of data and its ground-truth. Goncerning the kind of 
data, we decided to focus on symbols composed only of linear graphic primitives 
(straight lines and arcs). We did not consider symbols containing solid shapes 
or textures. The reason was to simplify, in this first edition of the contest, the 
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description of data, and to be able to provide vectorial data for all symbols. We 
only chose two application domains, architecture and electronics, which were 
the most used by the participants, but we mixed symbols from both domains in 
the datasets. We organized the data in three different sets of symbols: the first 
one containing only 5 symbols, the second one with 20 symbols and the third 
one with 50 symbols. This way we could evaluate how the performance of the 
methods evolved with an increasing number of models to consider. In figure 2 we 
can see some of the symbols used in the contest. As we decided to use synthetic 
data, the data set was limited to pre-segmented images of the symbols, and no 
real drawings were considered in this first edition of the contest. We only focused 
on the evaluation of recognition and not segmentation plus recognition. 
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Fig. 2. A sample of 20 symbols used in the contest. 



In order to favor the participation in the contest, we provided all images 
in both possible formats: bitmap and vectorial. Vectorial images were provided 
in .VEC format [8] as this is the simplest format able to represent vectorial 
data and is well-known by the graphics recognition community. Binary images 
were generated from the vector representation. There is not any vectorization 
process applied to binary images to get the vector representation so that results 
concern only recognition and not vectorization. This way, we do not introduce 
any alteration to data as a result of vectorization. Nevertheless, a participant 
could take the binary images and apply its own vectorization method. 

We defined three categories for the generation of degraded images: the first 
one contained rotated and scaled images of each symbol. The second one aimed to 
model binary degradation of images such as that produced by printing, scanning 
or photocopying. A model based on the method proposed by Kanungo et al. [17] 
was used to generate nine different models of degradation, illustrated in figure 
3. This model was applied to binary images and thus, no vector representation 
is available for this kind of images. Finally, the third category included images 
with shape distortions, similar to those produced by hand-drawing. The model 
for generating such images is based on the Active Shape Models [19], and some 
sample images can be seen in figure 4. This model is applied to vectorial images 
and thus, both vectorial and binary images are available. With these three kinds 
of degradation, we cover a wide range of noise which can be found in real images. 

72 different tests were designed to evaluate the performance in all these cases. 
These tests were grouped into five categories according to the type of the data: 
tests containing only ideal images of the symbols, test with rotated and scaled 
images, tests with binary degradations, tests with shape distortions and tests 
combining binary degradations and shape distortions. For each category, several 
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some models of degradation used. 




(a) (b) (c) 

Fig. 4. Examples of increasing levels of vectorial distortion. 



tests were designed, each one with a different number of symbols or different 
levels of degradation and distortion. With this organization of the tests, we could 
evaluate the robustness of methods to the most common types of degradations 
and to the increase in the number of symbols to be recognized. Before the contest, 
some sample tests for each category were supplied to the participants so that 
they could train their methods with them. Each test was independent of all the 
others so that each participant could choose the tests he wanted to take part in. 
The description of the tests was formatted as a . ini file because only information 
about the models and the images involved in the test was required. However, 
each participant had to provide the final results for each test in an XML file, as 
several parameters (label of the symbol, computation time, rotation, scale, etc.) 
could be provided for each image in the result file. Although the XML format of 
the file can include more information about the result of the recognition, in this 
first edition of the contest we only considered the label of the recognized symbol 
and the computation time. 

Five methods took part in the contest. A more detailed analysis of the contest 
and the results achieved by each method can be found in [11]. 



7 Conclusion 

The performance evaluation of symbol recognition is not a trivial task. Many 
characteristics and particularities have to be taken into account: the application 
domain, the graphical entities composing the symbol, the format of the data, 
segmentation, vectorization, noise and degradation, the definition of performance 
measures, etc. In this context, many precautions have to be taken. Indeed, the 
purpose must be to determine the strengths and weakness of symbol recognition 
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methods. Thus, a generic and independent framework have to be proposed in 
order to measure, as well as possible, all these criteria. A lot of open questions 
arise when trying to define it, concerning the kind of data and how to generate 
it, the performance metrics, and the protocol of evaluation. We have synthesized 
some of them in this paper, those which led us to the organization of a contest 
on symbol recognition at GREC’OS. We have also described the options taken 
to set up the framework for the contest, among that set of possibilities. 

This framework should be considered only as a starting point in the definition 
of a general framework for the evaluation of symbol recognition. We are aware 
that in this first edition of the contest we had to choose simple options and we 
left out many interesting issues. Thus, there are several points which should be 
considered in order to make this framework more general and complete: 

— Collecting more images from different application domains, in order to get a 
really large dataset, with symbols including all kinds of graphic primitives, 
and thus to be able to measure the scalability of the methods. We should 
organize this dataset according to the application domain, the shape of the 
symbols or some other criteria. 

— The relation of recognition with other processes, such as vectorization or 
segmentation, should be clearly determined in order to establish how these 
processes influence the performance of recognition. Particularly, we should 
provide non-segmented images and define how to evaluate them. 

— More models of image degradation and distortion can be studied and used 
to generate noisy images. 

— Performance measures, other than the label of the symbol, should be consid- 
ered, such as orientation, scale, ability to segment, computation time, etc. 
How to combine all these measures also remains as an open question. 

— Finally, the evaluation framework should also provide tools for the automatic 
evaluation of any recognition method, and as much as possible be perma- 
nently available in order to constitute a tool of reference. We are planning 
to provide this service thanks to a forthcoming web site. 
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Abstract. This paper presents the main design and development issues 
of the Qgar software environment for graphics recognition applications. 
We aim at providing stable and robust implementations of state-of-the- 
art methods and algorithms, within an intuitive and user-friendly envi- 
ronment. The resulting software system is open, so that our applications 
can be easily interfaced with other systems, and, conversely, that third- 
party applications can be “plugged” into our environment with little 
effort. The paper also presents a quick tour of the various components 
of the Qgar environment, and concentrates on the usefulness of this kind 
of system for testing and evaluation purposes. 



1 Introduction 

The design of document analysis systems implies mastering a number of complex 
issues. Firstly, and obviously, the designers have to put together a set of stable, 
robust implementations of state-of-the-art methods and algorithms, to perform 
the various processing and recognition tasks needed in document image analysis: 
Image preprocessing and filtering, binarization, text-graphics separation, vector- 
ization, feature extraction, character and symbol recognition, etc. Best practices 
in software engineering, as well as the need for performance assessment and eval- 
uation tools, lead to the requirement that each function, class, or task should be 
associated with data sets for testing and evaluation purposes. 

Secondly, the system must be easy to use (and hence provide an intuitive 
user interface), easy to install, and well documented. The various tools and 
applications should be as generic as possible, so that the system does not solely 
work for a tiny subset of document analysis tasks. 

The third requirement is probably the most important: The system has to 
be open, versatile, and interfaceable. Even if the best care is given to all possi- 
ble aspects of a document analysis problem, and even if the development team 
has unlimited resources (which is obviously never the case), some applications 
will inevitably not be foreseen in the system. Had they been foreseen, other 
teams may still want to use their own algorithm or method for a given problem, 
and may want to interface their tools with the general system, without being 
constrained by a rigid framework, non-standard formats, etc. Conversely, the 
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document analysis system, even if it provides by itself a large coverage of docu- 
ment analysis needs, may have to become a component of a larger system. Once 
again, interfacing and openness are vital requirements. 

These design and integration problems are well-known to many researchers 
in document analysis, and more generally in image processing. One of the most 
famous initiatives to deal with them was the Image Understanding Environment 
(lUE) [1,2], which aimed at providing a complete environment for all kinds of 
image processing applications. Many useful ideas came out of this project, but 
the resulting environment is extremely complex and hard to fully master. 

Of course, a number of specific, ad hoc document analysis systems have been 
implemented, but the design problems are undoubtedly of another order of mag- 
nitude when the system has to be versatile, as the flexible integration of all 
the system components becomes a crucial problem on its own. Still, there has 
been clear successes in some subareas. For instance, there is good know-how 
in building flexible and versatile OCR and page reader systems, taking into 
account segmentation, feature extraction, classification, and various linguistic 
post-processing steps [3]. Extending the concept to various business documents, 
Dengel et al. have also developed a very mature system, capable of adaptation 
to different types of documents [4,5]. Other specific application domains have 
their mature systems, such as bank check processing software [6], table recog- 
nition [7], or forms processing [8]. Some effort has also been made in building 
flexible environments for handwriting recognition [9] . 

However, when it comes to graphics-rich documents, most systems remain 
very context-dependent and are just able to solve a very specific problem. Few 
teams made attempts to build generic tools. One of the few early initiatives 
in this direction comes from Pasternak [10], who proposes a hierarchical and 
structural description coupled with triggering mechanisms for interpretation. 
Another example is the DMOS method proposed by a team at IRISA [11], and 
based on syntactical tools to describe the domain knowledge. 

Our team is involved since several years in the development of such an open 
software environment (called Qgar) according to a software engineering view- 
point [12]. In some aspects, it is similar to the Camera framework [13], which 
provides domain experts with a high abstraction-level, user-friendly environment 
to build, to tune, and to compare document analysis applications. Its plug-in ar- 
chitecture allows a quick integration of new state-of-the-art methods. Qgar also 
offers a common environment where applications and third-party contributions 
can easily coexist and interact. However, it mainly aims at providing the commu- 
nity with stable robust and generic document analysis methods, either as a set 
of off-the-shelf components or as an application builder toolkit [14-16]. Lately, 
we have successfully integrated Qgar with other tools and systems, as part of a 
French national project on software for document analysis [17]. 

This paper gives an overview of the main design issues for such a system. 
In section 2, we describe the general concepts which have guided us throughout 
the development. We then propose a brief tour of the environment (§3), before 
concentrating on the important issue of testing (§ 4), in a software development 
perspective, performance evaluation, and benchmarking. 
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2 Underlying Concepts 

When developing the Qgar software platform, we had three main objectives in 
mind: 

~ Ease the development of new image processing applications by grouping 
together implementations of graphics recognition methods and common data 
structures, 

— offer an environment to tune and evaluate the performances of our image 
processing applications and compare them with other existing ones, 

— demonstrate and distribute our know-how in the field of graphics recognition. 

The Qgar platform is an ever evolving project, getting enriched by all develop- 
ments required in the course of our research. In some way, it could be considered 
as a snapshot of our current know-how about graphics recognition and docu- 
ment image processing. This explains why we did not orient our development 
effort towards a particular functionality. We paid instead a particular attention 
to several (software engineering) issues, such as: 

~ Interoperability: Our platform has to easily interact at various levels with 
external systems. A high level of interoperability is a great asset to build 
document analysis chains integrating third-party applications. It is a great 
help for evaluation purposes, as it allows us to compare our work with others 
within a common framework. Interoperability is also a key feature for distri- 
bution as it makes part or whole of the platform available to other (research) 
groups with little effort. For example, the integration of Qgar applications 
into the DocMining platform^, for heterogeneous document interpretation, 
raised no major difficulty [17]. 

— Reusability: As stated above, the software reuse rate must be maximized 
from one research work to another. This implies that the platform archi- 
tecture and implementation must be highly modular, as, in particular, only 
specialized features from a given application are useful to build the next one. 
The thinner the granularity of the corresponding software components is, the 
more they may be reused. The architecture should also be flexible enough 
to allow a quick integration of existing software. 

— Extensibility: Since the platform is intended to evolve according to our needs, 
it must be extensible and easy to maintain. Adding new features must not 
interfere with already implemented functionalities, and updating or extend- 
ing existing code to fulfill more precise or slightly different purposes must be 
simple. A failure to achieve such requirements would lead to the construc- 
tion of concurrent versions of the platform, specifically designed for different 
aims. This would contradict the very purpose of our project. 

^ This project is supported by a consortium including four academic partners, PSI 
Lab (Rouen, France), Project Qgar (LORIA, Nancy, France), LSI Lab (La Rochelle, 
France), DIUF Lab (Fribonrg, Switzerland), and one industrial partner, GRI Lab 
from France Telecom R&D (Lannion, France). It is partially funded by the French 
Ministry of Research, under the auspices of RNTL {Reseau National des Technologies 
Logicielles) . 
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We had to conform to several principles to meet these requirements, and 
first of all enforce software quality procedures. As capitalizing on previous de- 
velopments is one of our objectives, we try to produce high quality software. 
Software poorly documented or written without thorough design is not likely 
to be adapted to purposes different from the initial ones. If rigorous design and 
implementation are essential to produce high standard code, they must be com- 
plemented by good documentation and extensive test suites to result in long 
lasting code. 

Documentation is essential when coming to code reuse or interoperability. 
Maintaining a component, modifying it, or making it interact with other soft- 
ware artifacts requires information on its design and implementation. To ensure 
the best documentation coverage, the full source code of Qgar is therefore self- 
documented. The documentation is embedded in code comments and can be au- 
tomatically extracted using Doxygen^ to produce online browsable HTML pages 
or high-quality hardcopy documents. This documentation is complemented by 
several manuals, including design papers and tutorials. 

Testing is another aspect of software quality that is essential in the Qgar 
context. Regression and performance tests are required to detect errors when 
integrating, upgrading or refactoring pieces of code. Considering the modularity 
of the platform architecture, unit testing appeared to be the best methodology 
to design test suites. Each Qgar module is thus associated with its own inde- 
pendent set of regression and performance tests. They provide great help to 
detect the propagation of errors when integrating a new module, to evaluate the 
performance gain when refactoring, and much more... 

C-l— I- has been chosen as programming language, so as to combine the object- 
oriented programming paradigm, which is appropriate to modular and easy-to- 
maintain software production, with several other advantages of the language, all 
conforming to our requirements. When efficiency becomes more important than 
modularity, it allows the use of well-known C programming tricks, cutting across 
the computational overhead implied by an object-oriented design. C-|— I- is also, 
more or less, an industrial standard, giving access to a lot of existing software. 

The Qgar software environment is distributed^ under the terms of the GNU 
Lesser General Public License (LGPL) and, for the user-interface only, the Q 
Public License (QPL). In this way, our work can be freely used, modified and 
redistributed by others, and we can also reuse existing code published under any 
LGPL compatible license. In fact, the choice of such a license helps to improve 
the software quality in the long run, thanks to the feedback provided by the users: 
Gomments, bug reports and corrections, or even requests for new features. 

3 System Overview 

The Qgar software system is divided into three parts: QgarLib, a library of G-I--I- 
classes implementing basic image and graphics recognition operators, QgarApps, 

^ http://www.doxygen.org/ 

® The Qgar system may be downloaded from its web site at http://www.qgar.org/. 
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an applicative layer constructed from QgarLib components and providing graph- 
ics recognition applications, such as text-graphics separation or vectorization, 
and QgarGui, a user interface dedicated to result visualization and application 
supervision. The whole system represents about 60,000 lines of C-I--I- code. A par- 
ticular attention has been paid to the support of standard formats (PBM-I-, DXF, 
SVG), high-quality documentation, configuration facilities (using autoconf /au- 
tomake), and support of Unix/Linux operating systems. 



3.1 The QgarLib Library 

The object-oriented paradigm is based on data encapsulation, i.e. describing 
abstract data types and their interfaces with the clients. However, image (or 
graphics) processing generally concerns collections of operators, i.e. procedures, 
to be applied to images. There are two solutions to implement such proce- 
dures in an object-oriented way. The first one consists in implementing them 
as function members of the classes of the objects on which they operate. For 
example, the convolution of an image by a Gaussian could be defined as the 
gaussianConvolution function member of the Image class (that describes im- 
ages) . Such a technique implies ongoing modification of the interfaces of existing 
classes, whenever new methods are added to the library. Moreover, the inter- 
faces of the corresponding classes grow in proportion and become too large to 
be efficiently used. 

We thus preferred a more pragmatic solution, based on a very simple idea. 
Image becomes the base class of a hierarchy, and an operation that processes an 
image is implemented by a constructor of a new derived class: The convolution by 
a Gaussian is thus implemented as the constructor of the GaussianConvolution 
class deriving from the Image class. In this way, there is no need to modify the 
existing hierarchy and classes, and different methods to perform the same con- 
ceptual operation can be easily implemented through derived classes of an ab- 
stract class. This principle has two main advantages. On the one hand, software 
parts separately designed may be easily integrated in the Qgar software, even 
when written in G. On the other hand, designers as well as clients of the library 
can write compact and easy-to-read code, which meets the understandability 
requirements [18]. 

To illustrate the philosophy of our approach. Figure 1 shows excerpts from 
a text-graphics segmentation application, implementing our own variation of 
Fletcher and Kasturi’s algorithm [19,20]. In the figure, we did omit the compu- 
tations themselves - obviously, such an algorithm cannot boil down to merely 
calling a succession of constructors - but the example should give the general 
idea of the ease of writing a graphics recognition algorithm using the classes pro- 
vided by QgarLib. As one can see, the code remains quite understandable. This 
is an idyllic view as, in the general case, additional parameters are needed to 
accurately perform the different image processing steps. However, we firmly be- 
lieve that the general philosophy holds for most low-level and intermediate-level 
operations in graphics recognition applications. 
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// File containing the initial (binary) image 
PbmFile pbf ("an_image.pbm") ; 

// Load the image 
Binarylmage initlmgCpbf ) ; 

// Prune small connected components from the image 
PrunedConCompBinarylmage prrmedlmgCinitlmg, 5); 

// Extract connected components from the resulting image 
ConnectedComponents components (prunedimg) ; 

// Do some filtering to get the textual components : 

// The result is given by variable txtComponents 

// Create the binary image of the textual components 
ComponentBinarylmage txtImgCcomponents , txtComponents); 

// Subtract the textual components from the initial image 
// to get the graphical components 
initimg -= txtimg; 

// Save the image of textual components 
PbmFile txtf ("txt_image .pbm") ; 
txtimg. save (txtf) ; 

// Save the image of graphical components 
PbmFile graphxf ( "graphx_image . pbm" ) ; 
initimg. save(graphxf) ; 



Fig. 1. Excerpts from a text-graphics segmentation application. 



QgarLib classes implement most data structures and algorithms to be han- 
dled when designing a document analysis system (graphical objects such as 
points, segments, arcs of circle... image processing operations such as convo- 
lutions, mathematical morphology... and so on) as well as different kinds of 
utilities, especially to store graphics or image data in files using various formats. 
The introduction of XML in the library has become inevitable, as it is now a 
standard format which is accepted by an increasing number of tools. Moreover, 
import and export of SVG data cannot work without it. However, to avoid com- 
patibility problems between various software licenses, an external XML library 
cannot be directly integrated. We have therefore adopted a solution allowing the 
use of any SAX XML parser. It is based on a set of interfaces providing an indi- 
rection level between the Qgar system and the external parser. These interfaces 
have to be implemented by concrete classes acting as wrappers between an XML 
API and the system. Wrappers for the Xerces and Qt parsers have already been 
implemented. 
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3.2 The QgarApps Applications 

QgarApps is a set of stand-alone applications representing the applicative layer 
of the Qgar system. Some of them are simple wrappings of QgarLib classes 
implementing the construction of useful objects, essentially intended to serve 
generic purposes, whereas others are more sophisticated and specific programs. 
About ten applications are available, such as binarization, thin-thick segmenta- 
tion, text-graphic segmentation, vectorization, etc. 

An application is run from a command line and works as a black box, to which 
clients have to supply arguments (input/output images, thresholds values, etc.) 
suited to the result they want to obtain. Communication between applications 
is file-based, in order to simplify the implementation. The interoperability and 
integration capability of the Qgar system relies on these applications, which 
can be sequentially invoked to build processing chains performing document 
analysis tasks. Chains performing the same task according to different methods 
can be quickly elaborated and can be conveniently compared, for performance 
evaluation for instance. 

However, an application is practically useless if clients are not supplied with 
a description of the application itself (the tasks which are performed, the corre- 
sponding methods, etc.), of its parameters, and of the syntax of the command 
line. This information is given by a XML file associated with the application. 
In this way, an “external” application, like the DocMining platform [17], can 
integrate any Qgar application, provided that it is able to parse the correspond- 
ing XML description to “understand” the behavior of the Qgar application and 
to get the kind of data the Qgar application must be fed with. Conversely, a 
Qgar application can be interfaced with any application designed according to 
the same principle like, for example, the graphical user interface QgarGui (see 
next section), or the Qgar web site, that proposes dynamic demos of Qgar ap- 
plications. In the latter case, a HTML form is automatically generated from 
the description file to get parameter values supplied by clients to run selected 
applications (Fig. 2). 



3.3 The QgarGui User Interface 

QgarGui provides users with a friendly environment to visualize and interact 
with every step of a document processing chain. It is completely independent 
of the other parts of the platform, though it may be used to control any Qgar 
application. As previously mentioned, an application is integrated within the 
interface with the only help of its description file. This description is used to 
automatically generate dialogs to tune the application parameters (for a spe- 
cific issue) and to make online help directly available thanks to the embedded 
application documentation. For example. Figure 3 shows the activation dialog 
created from the description of an application which performs a binarization. 
Even when designed apart from the Qgar platform, an application can thus be 
very easily integrated into the platform, without recompiling any part of the 
platform itself, as long as a description is provided. In this way, the interface 
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Fig. 2. HTML form automatically generated by the Qgar web site for the application 
performing text-graphics segmentation. 



<application> 

<descr> 

<name>Fixed Binarization</name> 
</descr> 

<paramlist> 

<param name="Source Image" 
f lag=" in" 

type="grayscale" /> 

<param name="Target Image" 
f lag="out" 
type="binary" 
def ault="_ .pbm" /> 





Fixed Binarization 

Call Parameters | Command Line | Description | 

p Input Parameters 

@ Source Image(PGM) ^ |issy_15.pgm 

^ Threshold(int) |ll3 ^ 

pOutput Parameters 

(2) Target Image(PBM) ^ |issy_15.pbm 



<param name="Threshold" 
f lag="thr" 
type="numeric" 
def ault=" 127" 
min="0" 
max="255" /> 
</paraunlist> 
</application> 

(a) 



Help I 



I Launch | 



Cancel 



(b) 



Fig. 3. Automatic integration of an application of binarization: From the XML descrip- 
tion of the application (a), QgarGui generates a dialog box (b) to tune its parameters 
and then to run it. 



may also be appropriately used as a flexible tool to compare the efficiency of 
different applications/methods delivering the same kind of result. 

QgarGui supports bitmap and vectorial image formats, such as subsets of 
DXF or SVG. It offers features to apply and control image processing tools, to 
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display results (including zooming and image superimposing facilities), and to 
manually correct them {i.e. add missing results, delete or alter erroneous ones...) 
if needed. In particular, special operations to edit for both bitmap and vectorial 
images are supported, such as copy and paste of bitmap images, creation and 
modification of vectorization components, etc. The interface is implemented in 
C++ using the Qt framework"^ and is distributed under the QPL license. This 
makes it easy to be upgraded or customized by anyone interested in its features. 



4 Testing 

As mentioned in § 2, testing is an important part of the Qgar platform develop- 
ment and it takes place on two levels, with different purposes. 

The first level is related to the developer viewpoint. It is based on a white 
box approach with thin granularity, each test focusing on several code lines, up 
to a full C++ class. These tests are implemented according to the unit testing 
approach, as defined by the extreme programming methodology [21], and are 
directly integrated in our software development process. For each C++ class, a 
“twin” test-class implements as many test-function members as features of the 
original class. A test-function is a set of assertions which automatically check if 
the code of the corresponding feature runs correctly. Once written, these tests 
are handled by a dedicated framework and can be run individually or together, 
using a separate API, CPPUnit^ . Every time new code is introduced in the 
platform, or old code is updated, all tests are run. 

Such a systematic test policy offers several immediate advantages. It provides 
developers with an instant feedback on the design of any new code. If the tests 
are hard to write, it generally means that the proposed design is either improper 
or too complex. It also improves the modularity of the platform: Each new class 
is implemented along with a direct client class, the test-class, that constrains 
developers to focus on the services offered by this class, rather than on their 
implementation. 

There are also long term benefits. The set of all the test-classes provides a 
solid regression test suite. This eases the process of refactoring [22] , as tests can 
be run every time a code change occurs, ensuring that no other feature has been 
broken. The code quality is improved on the long run: Every time a glitch is 
found, a new test is introduced to ensure that it will not appear again. 

The second testing level, called functional testing, is related to the client 
viewpoint and focuses on the ready-to-use services offered by the software, that 
is to say Qgar applications, in our case. Functional tests are written once an 
application is ready to be run. Their are designed to validate that the application 
fulfills the tasks it is supposed to perform and to evaluate its performances. The 
tests are completely independent of the application implementation. They see 
the application as a black box and proceed only by comparing its inputs and 
outputs. This approach is really interesting when combined with the method of 

^ The Qt framework is available at http://www.trolltech.com/. 

® http:/ /cppunit. sourceforge.net/ 
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integration of third-party applications in the platform (see § 3.2). Indeed, the 
tests defined for an application can be reused for any application designed for 
the same purpose. This makes Qgar a good tool to compare different methods 
or different implementation of methods. 

Unfortunately, unlike unit tests, functional tests are complex to set up, and 
a lot of work remains to be done to define a general, complete framework to 
perform such tests. However, their importance is crucial to the field, as they are 
at the basis of numerous performance evaluation schemes. We have therefore 
started work on adding a number of evaluation methods and tools to the Qgar 
platform, providing, among other things, support for the VEC format used by 
various evaluation methods [23] and for the document image degradation model 
proposed by Kanungo [24]. These tools were used for our participation in the 
organization of the first symbol recognition contest, held in Barcelona during 
GREC’03 [25]. We also plan to integrate other evaluation methods designed by 
our group, notably the vector-to-ground-truth matching method developed by 
Xavier Hilaire within our group [26]. 

5 Conclusion 

In our work to develop the Qgar software environment, we do our best to aim 
at genericity, ease of use, and interoperability with other software systems. This 
led us to the adoption of rigorous design principles, which have been detailed in 
this paper. The result is a reference platform, available to any person interested 
in building a document analysis system. The environment can be used as such, 
obviously with its limitations and constraints. It can also be easily enriched by 
“plugging in” document analysis tools separately developed, on the unique con- 
dition that a description is provided for each new tool to be added, as explained 
in section 3.2. Conversely, the whole environment can itself become a compo- 
nent of a larger project, using the same principles, as we have proved with the 
DocMining project. 

In addition to the interest it may be for the community to freely down- 
load state of the art document analysis tools and methods, to use them, and to 
eventually integrate them into one’s own application, such a software environ- 
ment provides an efficient platform for running benchmarks, comparisons and 
performance evaluations. While software-oriented test facilities, as described in 
section 4, have been introduced in the platform, we have started adding sup- 
port for testing and benchmarking at the functional level. This effort will be 
emphasized in the coming time, as the Qgar environment provides a good ba- 
sis for conducting thorough evaluation campaigns on various document analysis 
methods. 
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Abstract. This paper presents a method for engineering drawings retrieval by 
their shape appearances. In this method, an engineering drawing is represented 
by an attributed graph, where each node corresponds to a meaningful primitive 
extracted from the original drawing image. This representation, which charac- 
terizes the primitives as well as their spatial relationships by graph nodes attrib- 
utes and edges attributes respectively, provides a global vision of the drawings. 
Thus, the retrieval problem can be formulated as one of attributed graph match- 
ing, which is realized by mean field theory in this paper. The effectiveness of 
this method is verified by experiments. 

Keywords: Engineering drawings retrieval. Attributed graph. Mean field theory 



1 Introduction 

On-line maintenance of large volume of documents, such as engineering drawings, 
has become a major research area recently due to the ever increasing rate at which 
these documents are generated in many application fields [1, 2]. It is very common a 
thing for designers and draft people to refer to previous drawing documents to get 
some inspiration or solutions already achieved. However, retrieving these documents 
generally is a slow and tedious work, which requires an exhaustive examination of the 
whole engineering drawing database. To facilitate such retrieval, textual content such 
as keywords has been widely used. While this information is helpful in retrieval, it is, 
however, a heavy work to generate such descriptions manually, besides this, several 
keywords only are always incapable of describing the true content in a drawing. 

Technical drawings retrieval, or any other type of matching problem, can be seen 
as a correspondence calculation process. Whether a database drawing is retrieved or 
not is thus determined by the correspondence value between the query drawing and 
this database drawing. For this purpose, the technical drawings are first represented 
by attributed graphs, where graph nodes correspond to meaningful primitives ex- 
tracted from the original drawing image, such as lines and curves, while the spatial 
relationships between these primitives are described by graph edges. Next, the graph 
nodes in query drawing are regarded as a set of labels to label those nodes in a data- 
base drawing so as to determine the correspondence between them. With this manner, 
both the database drawings which are similar to the query one and those which in- 
clude a similar part with the query one can be obtained. 

To generate attributed graphs from engineering drawing images, the original raster 
images should be firstly converted to vector form, say, vectorization. As a preliminary 
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step of document analysis and processing, many vectorization techniques have been 
developed in various domains [3-5]. In this paper, it is assumed that the affiliated 
information such as annotation text has been removed, and all the curves in the draw- 
ing images have been converted into one-pixel-width by some thinning algorithm. 
Firstly, the drawing image is decomposed into rough primitives based on the intersec- 
tion points. Then, a merge-split process is used to make them more meaningful. 

The primitives obtained by above process are treated as nodes to build attributed 
graph. Different from Beniot’s N-Nearest neighbor method [6], the Delaunay tessella- 
tion strategy is used to generate the structural description of engineering drawings. 
With this manner, the spatial relationships between these primitives are represented 
more naturally and meaningfully. 

By above process, the content and the structure of engineering drawings are repre- 
sented by means of attributed graphs. This enables us to interpret the retrieval prob- 
lem as one of matching two attributed graphs. In recent years, some methods have 
been proposed to solve graph matching problem [7-11]. Based on the compromise 
between speed and performance, mean field theory is adopted in this paper to solve 
graph matching problem. 

The remaining of this paper is arranged as follows: In section 2, we describe the 
engineering drawing decomposition process to divide the drawing image into mean- 
ingful primitives. How to construct graphs from these primitives, as well as attributes 
definition, is introduced in section 3. In section 4, graph matching algorithm is out- 
lined. At last, the experiments as well as some discussion are given in section 5. 



2 Primitives Extraction 

Engineering drawings are mainly consisted of basic primitives, which are assembled 
together by specific spatial distribution and structural relationship into cognitively 
meaningful objects. This hints us that a structural description can be employed to well 
represent the content of an engineering drawing. In this paper, a drawing image is 
firstly decomposed into rough primitives, which are then refined by a merge-split 
process into meaningful form. 



2.1 Engineering Drawing Image Decomposition 

As mentioned in the introduction, it is assumed that some pre-processing has been 
applied to original drawing images to remove the affiliated information such as anno- 
tation text, make them binary and composed of one-pixel-width curves only. For this 
kind of image, a simple method is designed to decompose it into rough primitives: 
Rough primitives extraction is implemented by following strategy: 

1) Calculate the number of 8-connected neighbors for each pixel. 

2) Link these pixels together, which are 8-connected and have no more than two 
neighbor pixels, to form a rough primitive. 

By this process, a curve maybe divided into several fractions, which should be 
merged by some measures to form a meaningful one. Following, a merge-split proc- 
ess will be used to re-arrange these rough primitives into meaning form. 
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2.2 Merge 

The objective of merge is to recover original drawing elements, such as curves, 
straight lines, from the obtained rough primitives. As shown in figure 1 where three 
curves intersect together, 6 rough primitives are obtained from above decomposition 
process, it is hoped that the original three curves (lines) can be recovered by this 
merging process. 




Fig. 1. Illustration of merging operation 

It is a natural thing to adopt tangent direction and spatial distance as the basis of 
merging operation. If two rough primitives are collinear and close to each other at 
their end points, there exists possibility for these two primitives to be merged to- 
gether. Besides this, to get a reasonable result, all the rough primitives near the inter- 
section must be considered simultaneously when determining the merge operation. 

Based on above idea, following algorithm is designed to determine merging opera- 
tions. 

> Criterion 

For any pair of rough primitives, following criteria must be satisfied if they are to be 
merged: 

1) Merge operation can only be performed at the end points rough primitives. 

2) Exclusivity. Merging is allowed at most once at an end point of a rough primi- 
tive. 

3) Collinearity and space gap. Only the tangent direction and the spatial difference 
between two rough primitives’ end points is small than a threshold, can they be 
merged at these two end points. 

Above criteria rule out most of rough primitives when considering possible merg- 
ing for one rough primitive. However, several candidates are still survived, from 
which the most reasonable should be determined. For example, in figure 1, segments 
C 4 to Cg are ruled out from Cj ’s candidates due to collinearity criterion, in other 
words, they can’t be merged with Cj , however, both Cj and C 3 are still possible to be 
merged with c, . In order for the exclusivity criterion to be satisfied, the most reason- 
able rough primitive, in this example, should be selected from the candidates. 
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The merging operation orderly traverses all rough primitives and processes each of 
the two end points of every rough primitive respectively. Assume C; the rough primi- 
tive being processed, the steps of merging operation are as follows: 

( 1 ) Select an end point of c, and denote it as a. 

(2) Exclusivity checking. Check whether c, has been merged at the end point a, if 
so go to step (7), otherwise go on the next step. 

(3) Obtain the merging candidates set of c, with space gap and collinearity rules. 
Check other rough primitives outputted from the image decomposition part, 
record the rough primitives and their corresponding end points that meet the 
following these rules. 

Denote c™ , m = 1, • • • , M as the candidate set, , m = 1, • • • , M the tangent direc- 
tion difference between the corresponding end points of these rough primitives and 
end point a of c, . 

(4) Calculate the minimum value of <7 f , <7, = min {d'” = ,M} . 

(5) Set the tolerance value and further narrow the range of candidate set: 
c" = ^1' \d'” <d^+ to/erance], c™ represents the new candidate set. 

(6) If the size of c" equals to 1, say, there is only one element in set c™ , denoted 
as c j . Regard Cj as the current rough primitive and calculate c ^ ’ s candidate 
set at its corresponding end point (recorded in step (3)) with steps (3) to (5). If 
the candidate set of Cj only contains rough primitive c^ and the correspond- 
ing end point of c, is a, merge them together at Cj ’s corresponding end point 
and end point a of c, . 

(7) Denote the other end point of c- as b, repeat above steps (2)-(6) similar to end 
point a. 

(8) Check whether all rough primitives have been traversed, if not, select the next 
rough primitive as current one and go to step (1); otherwise go to step (9). 

(9) Check whether any merge operation occurred during this traverse, if not, fin- 
ish the program, otherwise repeat steps (l)-(8) to perform next traverse. 

Unlike continuous curve, it is nearly impossible to get the true tangent direction 
value of discrete curve. Therefore, some errors will be resulted if only a simple mini- 
mum direction difference strategy is adopted. To cope with such cases, a tolerance 
value is introduced in above method to prevent the loss of true rough primitive from 
candidate set. 



2.3 Split 

In the rough primitives obtained from above process, there may exist sharp bends. 
Here, a split process is adopted to divide a primitive with sharp bends into several 
smooth ones. 
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Based on the curvature values, the significant bending points in each primitive can 
be found, which are then used to split this primitive into several smooth ones. 

1) Bending points detection. Let the primitive pixel and its curvature be c, and 
k , , c, is deemed as bending points if following conditions are satisfied: (a), k, 
is local maximum(minimum) in a centered window of width w . (b). 
k, > a *k^^ , where is the maximum curvature value, (c). |^, | > /? , where 
/5 is a predefined positive threshold. 

2) Remove a bending point if it is not obvious. Let pj and pj be two consecutive 
bending points, c^,C 2 ,c^^^ the curvatures corresponding to p^, and the av- 
erage curvature between these two points respectively. If: 

^avg > ^ * (Ci + C 2 ) / 2 , where <5 is a predefined threshold 

Replace p^ and p 2 with their average value as the new bending point. 



3 Attributed Graph Construction 

By primitive extraction process, an engineering drawing has been decomposed into 
basic primitives, such as curves, straight lines, etc. To get a structural representation 
of the content of a engineering drawing, these primitives as well as their relationships 
must be well described. In this paper, the attributed graph is adopted to realize this 
purpose, where a graph node represents a primitive while the graph edges indicate the 
relationship between these primitives. 

3.1 Graph Construction 

Delaunay tessellation strategy is a natural way for graph construction from engineer- 
ing drawing. In traditional Delaunay method, each primitive (curves, lines, etc.) is 
represented only by a point, such as middle point, as the input to graph construction. 
However, a long curve (line) in the original image may be near to several other primi- 
tives either at its middle part or at the end parts due to its long prolongation, therefore, 
a connection should be built from this primitive to each of these nearby primitives. If 
this long primitive is replaced just by its middle point, only the connections to these 
primitives which are near to this middle point will be survived, which leads to 
connections loss. Besides this, some false connections will also be generated due to 
the loss of primitive extension information. 

To prevent the loss of primitive extension information, the following method is 
used in graph construction: 

(1) Sample a long primitive evenly into multiple points. 

(2) Adopt these sampled points as input to Delaunay tessellation. 

(3) Graph simplification. These graph nodes sampled from the same primitive are 
merged into one node, with their corresponding connections merged together. 

By this sampling strategy, a primitive is represented by multiple points evenly dis- 
tributed along this curve (line), therefore, the relationship with other primitives which 
are near to this primitive either at the middle part or at the end parts is reserved. 
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3.2 Attribute Extraction 

Graph attributes play the key rule in characterizing the content of a engineering draw- 
ing, where node attributes depict the appearance of the primitives, such as circular, 
straight, or angular, while edge attributes define the spatial relationship between these 
primitives, such as parallel, intersectant, and so on. 

In this paper, direction histogram is used to describe the appearance of each primi- 
tive. To realize rotation invariance, Fourier transform is carried out with this histo- 
gram and the coefficients of this transform are used as node attribute. 

Directed edge attributes are proposed to highlight the relationship between two 
straight line segments. Let be the line under processed, and Lj its neighbor line, 
the relationship from to L, is defined by following components: 

> Relative angle a , say, the acute angle between and . 

> Relative length rl : the length of Lj divided by that of Lj . 

> Relative position rp , which describes the position of the intersection point. 

If the intersection point of these two lines locates at Lj > shown in figure 2. 1 
and 2.2, this attribute is obtained through dividing the smaller length of line seg- 
ment OD and OC by the larger one of them. Otherwise, it is calculated through 
dividing the length of OD by minus length of OC, as shown in figure 2.3. 

> Relative distance rd . It is defined as the length of the line segment connecting 
the middle points of these two lines, divided by the length of . 

> Relative minimum distance rmd . It is defined as the minimum distance between 
these two lines, divided by the length of Lj ■ 




2.1 2.2 2.3 

Fig. 2. Definition of relative position 



These attributes, which are scale and rotation invariant, can precisely define the 
spatial relationship between two lines. However, our final target is to get the relative 
relation between two arbitrary curves. For this purpose, the curves are first approxi- 
mated into straight lines, then, calculate the attributes between these two straight lines 
with above concept, which are used as the spatial relationship of original curves. 



4 Graph Matching by Mean Field Theory 

Graph matching is in fact a correspondence determination process [7, 8, 11], where 
the nodes in the query graph are seen as a set of labels to label these nodes in another 
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graph. Many algorithms have been proposed in the literature to solve this problem [7- 
11], such as mean field theory, relaxation labeling, genetic algorithm, etc. On the 
consideration of speed and performance, mean field theory is adopted in this paper for 
graph matching determination. 

Given two graphs g = {v^,e^j,),a,b = ,A and G = (L, j),i, y = I,---,/ , 
where and V- represent the nodes of these two graphs, while ^ and . repre- 
sent the directed edges of these two graphs, here, g and G are the abstraction graph 
corresponding to query drawing image and a database drawing image respectively. 

Graph matching can be defined as follows: Find the match matrix M such that fol- 
lowing cost function is minimized: 

A I A I A 1 

E(M) = 1111 ) + X X , F. ) 

a=l j=l b=l j=l a=l i=l 

I 

subject to ^ > where means node in g is related to 

i=l 

F, in G . 

The former part of this cost function represents the cost to match graph edges, the 



later part corresponds to the cost to match graph nodes together. Dis(-) represents the 



distance between two edges, while Dis'{ ) the distance between two nodes. 

With graph nodes and edges attributes described in section 3, the distance func- 
tions in above formula are defined as follows: 






cos t if ^ or E^ . is NULL 



^ + (1 - ) + (1 - ) + (1 - ) 



where, {a^,rl^,rp^,rd^,rmd^},{a 2 ,rl 2 ,rp 2 ,rd 2 ,rmd 2 } are the attributes correspond- 
ing to these two edges, see section 3 for details. Above distance function is in fact a 
weighted average of attribute components, where the contribution of each element is 
normalized to one, and equal weights are adopted. It should be noted that the edges 
are directed due to the directed attributes defined in section 3. 



Dis'(v^,V,) = 



\ cos t’ if or F; is NULL 

\Euclidean distance of the Eourier Coefficients of these two nodes 



According to mean field theory, an iteration procedure is obtained to get the value 
ofM: 

/ A 1 \ 



(I ai — ■ 



Dis\v ^ , F; ) -(- 2£ £ M Dis{e ^ , E^. ) 

b=i j=i 



=■ 



exp(gL) 

E,exp(?lp 
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Fig. 3. Illustration of engineering drawings processing, (a) Query drawing; (b) Primitives ex- 
tracted from (a); (c) Constructed graph of (a); (d) A database drawing; (e) Primitives extracted 
from (d); (f) Constructed graph of (d) 



In matching, the temperature T is decreased towards 0 as iteration goes. The un- 
ambiguity of M is achieved when T ^ o . The convergence is also controlled d by a 
quantity defined as: 

Saturation = — X X a i 

.T a i 

also called the ‘saturation’ of M . The iteration is terminated if the saturation value is 
larger than a predefined value, such as 0.95. 



5 Experiments 

At the beginning of this section, an example is used to demonstrate the whole process 
described above, see figure 3. Figure 3(a) and 3(d) are original query drawing and 
database drawing to be matched; Figure 3(b) and 3(e) shows the extracted primitives 
from these two drawings, where a filled rectangle is used to denote a primitive. Figure 
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3(c) and 3(f) illustrate the graphs constructed. In figure 3(f) dark filled rectangles and 
thick lines are used respectively to denote the graph nodes and graph edges that are in 
correspondence with those of the query graph. 

In this initial experiment, a database containing 100 engineering drawing images is 
set up to evaluate the retrieval performance of the proposed method. These images, 
which correspond to different designs of washbowls, are manually classified into 
three categories according to their appearances. The numbers of images in each cate- 
gory are 22, 22 and 56 respectively. The images in either first or second category are 
somewhat similar in that a common part can be found among them, however, the 
images in third category are distinct from each other. Figure 4 shows some represent- 
ative images of each category. 




Fig. 4.3 



Fig. 4. Example drawings of the database (Fig. 4.1 First category. Fig. 4.2 Second category. 
Fig. 4.3 Third category) 



In the experiments, we generate totally 20 query images to evaluate the retrieval 
performance, with each 10 of them being similar to the common part of first and sec- 
ond category respectively, and the average precision and recall values are obtained 
from these trails. Figure 5 shows an example of retrieval result, where the former 20 
images are displayed. In figure 5.2, rectangles are used to denote those parts in the 
database drawings that are similar to query one. The average precision and recall 
value are illustrated in figure 6. 

In former 22 retrieved images, which is the true number of each category, the aver- 
age precision is about 56%, and the recall value is more than 80% in the former 50 
retrieved images. This retrieval can largely facilitate the designers in searching a spe- 
cific previous drawing, thus saves the cost in designing a new product. All these ex- 
periments are carried out at a P4-1.3G computer, and the average time used for once 
retrieval is 111.5 seconds, that is to say, the average time for matching one database 
engineering drawing with the query one is about 1.1 seconds. 
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Fig. 5.1 Fig. 5.2 

Fig. 5. An example of retrieval result (Fig. 5.1 Query image, Fig. 5.2 Retrieval result, with 
former 20 retrieved images shown) 




Image index 

Fig. 6. Retrieval performance 



6 Conclusions 

In this paper, we present a practical graph matching based method for engineering 
drawings retrieval. By structural representation, the content as well as the spatial 
relationship of the engineering drawing are well described. While the reasonable 
primitives in an engineering drawing, say curves, straight lines, ellipses and so on, can 
be extracted by the merge-split process proposed in this paper. Thus, engineering 
drawings retrieval becomes a problem of comparing the structural representations, 
which can be implemented by graph matching. 

In this paper, it is assumed that all lines and curves in the object image are one- 
pixel width in clean background. However, as a kind of method, the technique de- 
scribed in this paper is not limited to such images only. By some pre-processing, such 
as de-noise, bar detection, etc, the rough primitives can be obtained from any kinds of 
engineering drawings. After this, above method can be used to form reasonable primi- 
tives, build structural descriptions, and matching for retrieval. 
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Abstract. This paper proposes a general architecture to extract knowledge from 
graphic documents. The architecture consists of three major components. First, a 
set of modules able to extract descriptors that, combined with domain-dependent 
knowledge and recognition strategies, allow to interpret a given graphical doc- 
ument. Second, a representation model based on a graph stmcture that allows 
to hierarchically represent the information of the document at different abstrac- 
tion levels. Finally, the third component implements a calligraphic interface that 
allows the feedback between the user and the system. The second part of the pa- 
per describes an application scenario of the above platform. The scenario is a 
system for the interpretation of sketches of architectural plans. This is a tool to 
convert sketches to a CAD representation or to edit a given plan by a sketchy 
interface. The application scenario combines different symbol recognition algo- 
rithms stated in terms of document descriptors to extract building elements such 
as doors, windows, walls and furniture. 



1 Introduction 

Graphics Recognition is the subfield of Document Analysis that is concerned on the 
interpretation of graphical structures present in documents like plans, maps, engineer- 
ing drawings, musical scores, charts, etc. A lot of research has addressed the problem 
of converting paper-based graphics documents to electronic formats. Classically, this 
conversion involves activities organized in three levels: feature and low level primitive 
extraction, primitive (symbol) recognition, and document understanding using domain 
knowledge. Techniques from the fields of Image Processing, Pattern Recognition and 
Artificial Intelligence formulate the above activities. 

A number of high performant Graphics Recognition systems exist. Not only in aca- 
demic domains but also as industrial applications. However, most of such systems are 
highly domain-dependent. The development of general graphics recognition platforms 
remains still a challenge. Some interesting approaches exist that are focused on that 
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goal [1-3]. In the existing applications the methodological basis is mainly the same. 
There is no reason to reinvent the wheel for each application, it is better to combine and 
parametrize basic tools in terms of a domain-dependent strategy. On the other hand, the 
concept of document is becoming more and more extensive. The classical idea of paper 
document is currently complemented by electronic documents (pdf, html, etc.). Hence, 
the idea of converting paper-based documents to electronic format is now extended to 
the idea of understanding poorly structured documents from different sources. This re- 
sults in new application scenarios as document browsing, indexing, editing by means of 
on-line sketch-based interfaces, etc. However, these new scenarios continue using the 
same core of techniques. 

The above reasons justify the need for a general platform devoted to graphic docu- 
ment analysis and recognition. This is the goal of the architecture proposed in this work. 
Inspired by the idea of the french DocMining project [1], we propose an architecture 
for document mining, i.e. a set of engines to extract different kinds of descriptors from 
documents. The combination of such descriptor extractors with the domain-dependent 
knowledge defines an application scenario. 

On the other hand, the role of the user in the graphics recognition cycle is also a 
key issue. The user intervention in a graphics recognition process should not be seen 
as negative but a natural issue. A recognition and understanding process may need the 
feedback from the user to set particular parameters, to validate decisions taken from the 
system when there is uncertainty or just to interact with the system to edit the document 
or drive the process. The sketchy interfaces paradigm is very relevant to the last issue, 
i.e. the user interaction by means of pen strokes is a powerful tool to draw new graphic 
documents, to digitally augment paper documents or to edit documents by sketchy ges- 
tures. Our idea in the system architecture proposed in this paper is to include a sketchy 
interface as a component of a graphics recognition system. 

Taking into account the above considerations, in this paper we propose a general 
architecture for knowledge extraction from graphic documents. This architecture fol- 
lows a document mining paradigm and can be summarized in three major issues. First, 
it consists of a set of engines to extract knowledge sources or descriptors. A descriptor 
is a feature that afterwards is likely to have a particular meaning in the domain where 
the document belongs to. Thus, a feature can be a segment after a vectorization pro- 
cess, a shape, a colour or texture based segmented region, a perceptually salient pattern, 
a text component, etc. The particular interpretation of a given document depends on 
the combination of descriptors in terms of domain-dependent rules. The second issue 
of our architecture is the definition of a hierarchical metadata model to represent any 
graphic document. It consists in an abstract model that, using the set of descriptor ex- 
traction engines, allows to convert a graphic document into a normalized document. 
Finally, the third basis of our architecture is a tool for the integration of the user in the 
document understanding cycle. As stated before, this tool is based on the sketchy inter- 
facing paradigm and, hence allows to either design new diagrams or edit existing ones 
by means of gestures. 

The above architecture makes sense when it can be instantiated for a given scenario, 
i.e. for a particular application. To illustrate it, the second part of this paper presents an 
architectural plan analysis system. Architectural sketches have been the focus of sev- 
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eral works on the recognition of sketched-based interfaces [4,5]. These systems are 
specifically designed to analyze and recognize architectural drawings and are based on 
the application of specific rules about the domain. Our system aims to follow a general 
framework which could be used in other applications. The scenario presented in this 
paper is though as a tool to assist an architect in the early design phase of a new project. 
Hence, in this stage, an architect uses to convert ideas to sketches. Actually, the system 
presented in this paper is part of a more general platform that aims to get a 3D view 
of a building from a sketched floor plan. Then the architect can navigate into the build- 
ing using a Virtual Reality environment. Our proposed system combines the following 
elements: first, a set of descriptor extraction modules and domain-dependent knowl- 
edge to recognize building elements (graphic symbols) as walls, doors, windows, etc. 
Second, a graph-based structure to represent the documents. And third, a sketch-based 
interface paradigm to draw architectural floor plans or to interactively edit existing ones 
by adding new elements. Our goal is not to describe a novel symbol recognition algo- 
rithm but to propose a system in which classical graphics recognition techniques and 
strategies are combined in a particular scenario consisting of interpreting architectural 
sketches. 

The remainder of this paper is organized in two parts. First, in section 2 we describe 
the general system architecture. The second part of the paper describes the particular 
use of the architecture elements in the scenario of sketched architectural plans inter- 
pretation. Thus, section 3 presents the feature descriptors, and section 4 describes how 
such descriptors are used in terms of domain-dependent knowledge to recognize build- 
ing elements. Section 5 discusses the experimental evaluation and finally, section 6 is 
devoted to conclusions. 

2 System Architecture and Application Scenario 

The system presented in this paper is a particular scenario of the general platform in- 
troduced in Section 1. This general platform can be organized in three layers: the first 
layer is devoted to acquisition of documents, whatever the acquisition mode: on-line 
or off-line. The second layer has to do with extraction of relevant features. Finally, 
the third layer consists of knowledge interpretation, using the extracted features. These 
three layers are joined using a common data structure that allows to represent, store, 
search and modify information at several levels of abstraction. The description of that 
model is presented in Fig. 1. 

The data model consists of a generic labelled non-directed graph able to represent 
the information at different abstraction levels by changing the kind of nodes and edges 
forming it. Both nodes and edges can be described by Graphic Objects. A Graphic 
Object is a generic class with some derived classes: Symbol, Point, Line, Arc, Region 
and Stroke. The Symbol class can be described by another graph, or by a Pattern class. 
The Pattern class is a generic class describing a shape that can be specialized into a 
graph, a grammar that describes the pattern by means of grammatical rules or a feature 
vector. The Stroke class is formed by a set of points. 

The second layer extracts the descriptors explained in Section 3, which are divided 
in four types: Vector-based descriptors: mainly lines, arcs and their relationships. Per- 
ceptual grouping descriptors as parallelism, collinearity, closed loops and overlapping. 
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These two descriptors need a vectorization [6] pre-process to obtain the segments and 
arcs from the input document. Pixel-based descriptors, as Zernike moments and zoning. 
Dynamic descriptors consisting of the chain of strokes drawn by a user together with 
its temporal information and speed. 

As it has been explained, the particular scenario of application of this general frame- 
work is the early design of sketches in floor architectural plans. Sketches can be drawn 
on-line or off-line. The aim of the platform is to allow the architects to design a plan 
in a natural way, recognizing its parts and storing it as a normalized document. The 
sketching interface paradigm allows the architect either to draw a new plan, to edit an 
existing one or to interact with the system by a sketchy gestures language. From the 
sketch the system is able to create a structured document consisting of its building ele- 
ments. Such building elements are described in terms of domain-dependent knowledge 
that mainly consists of prototype patterns and semantic rules, using the normalized data 
structure described before. The specialization of the general architecture to this scenario 
is graphically outlined in Fig. 2. 

In the first layer, the document can be obtained off-line by scanning a hand-drawn 
sketch made on a ’’classical way” on a paper, or on-line with a digital tablet, a tablet PC 
or a digital pen. Both ways cooperate in the creation process of the document. A first 
version of a sketch can be created in both off-line and on-line way. Then the system 
can recognize its parts and to store it in a data base to later recover and modify it in 
the on-line way. When the document is edited with a sketchy interface, the recognition 
process acts on the new input and not on all the already processed document. 

Once the document is obtained its structure is computed in the second layer of the 
architecture. The structure of this kind of documents is formed by the structural parts of 
the building: walls, windows and doors, the furniture: tables, sofas, chairs, etc. and the 
facilities as plugs, TV-connections, pipes, etc. Each kind of component can be identified 
by means of a set of characteristics that in some cases are common to more than one 
kind of symbol. 
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Fig. 2. The particular architecture for an architectural plan analysis platform. 



Finally the third layer extracts the knowledge from the descriptors and save it in the 
normalized format. Three kind of recognition approaches are used. Rule-based recogni- 
tion uses the information of parallelism to recognize windows and walls. String match- 
ing recognizes doors and some furniture symbols from the vectorial and dynamic infor- 
mation. Statistical classification recognizes some furniture symbols from pixel-based 
descriptors. All these approaches are explained in more detail in Section 4. 

The general data model described previously is used to get a description of the plan 
at several levels of abstraction in the following way. The first level of abstraction is rep- 
resented by a graph describing the layout of the sketch. The edges represent segments 
and arcs appearing on the document and the nodes represent the points connecting them. 
In a second level of abstraction the nodes in the graph can represent the closed regions 
in the document, and the edges their relationships. At a higher abstraction level the 
nodes represent rooms with its associated class of room: a kitchen, a bathroom, etc, and 
the edges are the relations among them. 



3 Feature Extraction Engines 

An architectural drawing is composed of different types of elements: walls, doors, win- 
dows, stairs, symbols describing furniture elements, etc. There is not a single represen- 
tation scheme nor recognition method able to describe and identify all of them. Then, 
according to the general system architecture described in section 2, several descriptors 
are extracted from the drawing, in order to get an optimal representation of every kind 
of element. These descriptors are combined to get a global representation of the draw- 
ing, using the common data structure explained in section 2. We have grouped those 
descriptors into four categories: vector-based descriptors, perceptual grouping descrip- 
tors, pixel-based descriptors and dynamic descriptors. 
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3.1 Vector-Based Descriptors 

Vectors always play an important role in the description of technical drawings. There are 
several approaches to vectorization [6]. In this case, we apply a vectorization process 
based on thinning and polygonal approximation of the image skeleton. The result of 
vectorization is a set of vectors (lines or arcs) and their relationships. They are organized 
using the common graph structure described in section 2. 

Vectorial representation is used for the description and recognition of those elements 
describing the structure of the drawing, such as walls, doors and windows. It is also the 
basis for obtaining perceptual grouping descriptors. 

3.2 Perceptual Grouping Descriptors 

Segmentation is a very important task in any graphics recognition system. The elements 
of the drawing must be located before being recognized. Usually, segmentation follows 
specific rules for each kind of drawing as the type of elements and their relationship 
are different from one kind of drawing to another. Sometimes, segmentation is carried 
out along with recognition, locating and recognizing the elements of the drawing at the 
same time. 

Another approach is to detect possible locations of elements in the drawing, and 
then, verify the existence of the element in that position with some recognition method. 
The detection of such possible locations are usually based on perceptual grouping tech- 
niques where salient features are detected. These features are necessary, but not suf- 
ficient, to determine the existence of a given element. Therefore, they can be used as 
seeds to search for one specific element. 

We have used this last approach and therefore, we have defined a set of descriptors 
which will be employed later in segmentation tasks, such as the detection of walls, win- 
dows and symbols. These descriptors are based on the vector-based descriptors obtained 
as the result of vectorization: 

- Parallelism: Detected when the difference in orientation between two nearby lines 
is near 0 or 180, given some preselected threshold. 

- Overlapping: The projection, either in the horizontal or vertical axis, of two lines 
must overlap, and the between the length of overlapping and the shortest line must 
be larger than a given threshold. 

- Collinearity: this condition is detected when a parallelism between two non over- 
lapping lines is found, and the difference in orientation between the line joining the 
farther end points and both lines is near 0 or 180. 

- Closed loops: they are detected trough the analysis of the vector graph. The loops 
are represented as a graph of adjacent regions, using the common data structure. 

3.3 Pixel-Based Descriptors 

Some of the symbols which appear in architectural drawings are quite small. As we 
are working with hand-drawn symbols, they cannot be represented by vectors, as the 
variability of hand-drawing introduces too much distortion in the vectorization results. 
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Therefore, these symbols must be described using several descriptors computed from 
the binary image and not their vector representation. We have used two different kinds 
of pixel-based descriptors: circular zoning and Zernike moments. Both methods rely on 
a previous segmentation of the symbol, which is carried out by analysis of connected 
components in the drawing. 

Circular zoning is based on a method developed by Adam [7], and it allows to get 
a rotation and scale invariant feature vector. Once the symbol is segmented, the image 
is divided into concentric zones as illustrated in figure 3. For each zone, the number 
of black pixels is counted and normalized according to the area of the zone. Thus, the 
feature vector contains one value for each zone, representing the ratio of black pixels 
in it. 




Circular zoning is not able to distinguish between all symbols, taking into account 
the large amount of variability in hand-drawn symbols. Therefore, we have introduced 
another set of invariant features based on Zernike moments in order to improve recog- 
nition accuracy. Zernike moments [8] have been successfully used in the context of 
OCR. They can be related to usual geometric moments, but they allow to reduce certain 
degree of redundancy in the information conveyed by geometric moments. Moreover, 
they can be easily used to dehne a rotation invariant feature vector. They are based on 
the projection over the Zernike polynomials of the image mapped on the range of the 
unit circle. We have taken Zernike moments from order 2 up to 6 (fourteen values), to 
build the feature vector. 



3.4 Dynamic Descriptors 

As we have explained, the system can take both off-line and on-line input. For on-line 
input, we can take advantage of the time information to improve recognition accuracy. 
With that aim, we have also defined a set of dynamic descriptors, obtained from the 
on-line input, organized at two levels of abstraction. 

The first level of descriptors is simply the sequence of points at each moment of 
time. This sequence of points is described with the usual chaincode representation. The 
second level of descriptors aims to extract the structure of the on-line sequence. This 
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way, each stroke (sequence of points) is divided into segments, each segment corre- 
sponding to a straight line or an arc in the drawing. This segmentation is achieved by 
locating the breaking points where a corner can be found, looking for changes in local 
and global stroke curvature - see figure 4 





(a) (b) 

Fig. 4. Division of strokes into segments, (a) Original image, (b) Segmentation. 



4 Knowledge Extraction 

Once descriptors have been extracted from the drawing, different processes must be 
activated to identify and recognize all the elements in it. As we have used different set 
of descriptors for each kind of element, we must also use different recognition methods 
to locate and identify them. We can group these methods into three different kinds of 
pattern recognition approaches; rule-based recognition, string matching and statistical 
classification. 

4.1 Rule-Based Recognition 

This approach is used for the detection of the structure of the building: walls and win- 
dows. It relies basically on the perceptual grouping descriptors. Starting from the paral- 
lelism and overlapping descriptors, it consists in applying a set of rules to determine if 
two or three parallel and overlapping vectors can be considered as a wall or a window, 
respectively (see figure 5). Collinearity is used to join different lines belonging to the 
same wall. 

The set of rules applied are based on the distance between parallel and collinear 
lines, the degree of overlapping between them, the aspect ratio of the rectangle enclos- 
ing the wall or the window and the length of the segments composing them. 



(a) Window. (b) Wall 

Fig. 5. Detection of walls and windows with three or more lines. 
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4.2 String Matching 

String matching [9] is used for the recognition of two different kinds of elements in 
the drawings: the detection of doors and the recognition of furniture symbols in on-line 
drawings. The recognition of doors is based on the previously detected loops, repre- 
sented as a string of vectors. Then, string matching relies on computing an edit distance 
between the loop found in the image and the loop representing the model of a door. The 
hnal distance value allows to determine if both loops are similar. 

For symbols in on-line drawings, the string matching is applied after splitting each 
stroke into a set of segments, and representing each segment with its chaincode. String 
matching allows to define a distance measure between each segment in the symbol im- 
age and each segment in the model of the symbol. This distance is computed weighting 
the final edit cost resulting from the string matching with other measures, such as the 
distance between end points of both segments and the difference in length between both 
segments. Once we have computed the distance among all segments, we can find the as- 
sociation between image segments and model segments yielding to the minimal global 
distance. This global distance is used as a measure of classification. 



4.3 Statistical Classification 

The recognition of furniture symbols in off-line drawings relies on pixel-based descrip- 
tors, namely circular zoning and Zernike invariant moments. As images will be rep- 
resented as feature vectors, we must use some statistical method to classify them. We 
have used the Mahalanobis distance as a criterion for classification, as it allows to take 
into account the variability in each class, learned from a set of symbol samples. 



5 Experimental Evaluation 

This section is devoted to illustrate the performance of the application scenario of archi- 
tectural sketch understanding. As described above, the system is a tool for early design 
stages of architectural projects. The system converts sketches of floor plans to a CAD 
representation consisting of building elements. We can distinguish three major cate- 
gories of elements: structure (walls, doors and windows), furniture (chairs, beds, tables, 
etc.), and facilities (electrical, illumination, etc.). Notice that since any of the above 
elements has its own diagrammatic representation, the sketch understanding problem 
actually consists of a symbol recognition problem. As it was noticed above, a key com- 
ponent of our architecture is a sketchy interface. Actually, the system can work either 
on-line or off-line and hence, the symbol recognition algorithms take into account such 
a twofold input procedure. Conceptually, we use a digital pen and paper paradigm, i.e. 
the system allows the user to input a scanned sketch, to interactively draw a new one or 
edit an existing one by the use of a TabletPC, or even to use paper as input medium but 
with paper augmented functionallity [10]. Figure 6 illustrates the sketch understand- 
ing process. First, the initial paper sketch of Fig. 6(a) has been scanned. The different 
feature extraction procedures and symbol recognition strategies described in sections 
3 and 4 have been applied. Figure 6(b) shows the reprinted document once different 
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symbols have been recognized. Afterwards, the user by means of a sketchy interface 
in a TabletPC has edit the plan adding some elements, see Fig. 6(c). In that case, dy- 
namic descriptors have been extracted and on-line symbol recognition methods have 
been applied to get the final result of Fig. 6(d). 

Let us now further analyze the performance of the system in a quantitative way. 
This analysis is formulated in terms of the performance of the different symbol recog- 
nition methods that are applied for the recognition of the graphical elements. To do so 
we have used the following groud-truth. We have used a set of 25 classes of architec- 
tural symbols. Ten different people have been asked to sketch at least ten instances of 
each symbol. We have then collected a database with 4200 symbol images with differ- 
ent distortion levels. Symbols have been drawn using a Logitech io pen device [10]. It 
has allowed us to have for each sample an on-line and an off-line version. The symbol 
recognition algorithms described in sections 3 and 4 have been applied to the samples 
in the database. The off-line process formulated in terms of statistical classifiers had an 
overall recognition rate of 69.3%. This rate was very sensitive to the symbol instances 
used to learn pattern models. Thus, the learning set was constructed by taking one sym- 
bol instance for each people and symbol class. This results in a very high intra-class 
variability and hence a high inter-class confusion rate. It can be noticed if we reduce 
the test set to the set of symbols drawn by only five people. In that case, the recognition 
rate ranges from 76% to 81%, depending on the five people selected. In addition, when 
a symbol is missclassified, we have considered the second option given by the clas- 
sifier. If it is taken into account, the recognition rate achieves a 84%. This suggests a 
system in which the models are personalized to a given user. Concerning to the on-line 
recognition symbol recognition method, we have gotten an overall recognition rate of 
99.08%, ranging from 95% in the symbol showing the worst performance to 100% in 
the symbols showing the best performance. For some of the symbols, we have defined 
two models in order to be able to adjust the method to different drawing styles. The 
on-line system is able to recognize 200 images per second. These results are still very 
preliminary and they can be improved introducing some variations in the classifiers. 



6 Conclusions 

A number of high performant graphics recognition systems exist. Most systems often 
use ad-hoc methods and are very domain-dependent. However, if we further analyze the 
existing systems we can notice that the methodological basis is almost the same, and 
from one system to the other the differences are the tuning parameters and the domain- 
dependent knowledge. In this paper we have proposed a general graphics recognition 
architecture. The architecture combines a set of feature extraction modules that, com- 
bined in terms of domain-dependent knowledge, allow to recognize document entities 
in a given application scenario. A second important component of our architecture is the 
definition of a relational metamodel that allows to hierarchically represent a document 
at different abstraction levels (from features to entities). Finally, in our architecture we 
have also been concerned in the HCI by defining a sketch-based interface that not only 
allows to create technical drawings but also to edit existing ones by adding new ele- 
ments or interpreting gestures as edit commands. The second part of the paper has been 
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Fig. 6. Sketch recognition process: (a) Initial sketch, (h) recognized entities, (c) added entities 
using a sketchy interface, (d) hnal result. 
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devoted to describe an application scenario, in particular, a system to convert architec- 
tural sketches, either off-line or on-line, to a CAD representation. The set of descriptors 
used by the symbol recognition methods have been briefly described and hnally the 
performance evaluation of these methods has been analyzed. For this last part, a groud- 
truth database of more than 4000 hand-drawn symbols have been used. The proposed 
architecture is still in an early stage. In addition to introduce improvements in the sym- 
bol recognition methods to increase the recognition rates, the improvements that we are 
now working on are the inclusion of other descriptor extraction methods and the design 
and development of a prototyping framework, i.e. a way to combine descriptor extrac- 
tion modules with domain-dependent knowledge to generate application scenarios. 
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Abstract. In this paper we describe methods of performing data min- 
ing on web documents, where the web document content is represented 
by graphs. We show how traditional clustering and classification meth- 
ods, which usually operate on vector representations of data, can be 
extended to work with graph-based data. Specifically, we give graph- 
theoretic extensions of the fc-Nearest Neighbors classification algorithm 
and the fc-means clustering algorithm that process graphs, and show how 
the retention of structural information can lead to improved performance 
over the case of the vector model approach. We introduce several different 
types of web document representations that utilize graphs and compare 
their performance for clustering and classification. 



1 Introduction 

Web document mining [1] is the application of data mining techniques to web- 
related documents. Web mining methodologies can generally be classified into 
one of three categories: web usage mining, web structure mining, and web con- 
tent mining [2]. In web usage mining the goal is to examine web page usage 
patterns in order to learn about a web system’s users or the relationships be- 
tween the documents. Web usage mining is useful for providing personalized web 
services, an active area of web mining research. In the second category of web 
mining methodologies, web structure mining, only the relationships between web 
documents are examined, usually by utilizing the information conveyed by each 
document’s hyperlinks. Like web usage mining, the actual content of the web 
pages is often ignored. 

In the current paper we are concerned with the third category of web mining, 
web content mining. In web content mining we examine the actual content of 
web pages (most often the text contained in the pages) and then perform some 
web mining procedure, most typically clustering or classification. Content-based 
classification of web documents is useful because it allows users to more easily 
navigate and browse collections of documents [3] [4] . Such classifications are of- 
ten costly to perform manually, as it requires a human expert to examine the 
content of each web document. Due to the large number of documents avail- 
able on the Internet in general, or even when we consider smaller collections of 
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web documents, such as those associated with corporate or university web sites, 
an automated system which performs web document classification is desirable 
in order to reduce costs and increase the speed with which new documents are 
classified. Clustering is an unsupervised method which attempts to separate web 
documents into similar groups, while classification is a supervised learning tech- 
nique which aims to assign a specific label to each web document. Clustering is 
done to organize web documents into related groups. This has benefits when the 
classes are not known a priori, such as in web search engines [5], since it allows 
systems to display results grouped by cluster (topic), in comparison to the usual 
“endless” ranked list, making browsing easier for the user. Using classification 
techniques with these types of systems is difficult due to the highly dynamic na- 
ture of the Internet; creating and maintaining a training set would be challenging 
and costly. For this reason, it is necessary to create clusters automatically from 
the data. 

Traditional information retrieval and data mining methods represent docu- 
ments with a vector model, which utilizes a series of numeric values associated 
with each document. Each value is associated with a specific term (word) that 
may appear on a document, and the set of possible terms is shared across all 
documents. The values may be binary, indicating the presence or absence of the 
corresponding term. The values may also be non-negative integers, which repre- 
sent the number of times a term appears on a document (i.e. term frequency). 
Non-negative real numbers can also be used, in this case indicating the impor- 
tance or weight of each term. These values are derived through a method such as 
the popular inverse document frequency model [6] , which reduces the importance 
of terms that appear on many documents. Regardless of the method used, each 
series of values represents a document and corresponds to a point (i.e. vector) in 
a Euclidean feature space; this is called the vector-space model of information 
retrieval. This model is often used when applying data mining techniques to doc- 
uments, as there is a strong mathematical foundation for performing distance 
measure and centroid calculations using vectors. However, this method of docu- 
ment representation does not capture important structural information, such as 
the order and proximity of term occurrence, or the location of term occurrence 
within the document. 

In order to overcome this problem we have introduced several methods of 
representing web document content using graphs instead of vectors, and have 
extended existing data mining methods to work with these graphs. These ap- 
proaches have two main benefits: 1. they allow us to keep the inherent structural 
information of the original document without having to discard information as 
we do with the vector model representation, and 2. they intuitively extend ex- 
isting, well-known data mining algorithms rather than create new algorithms, 
whose properties and behavior are unknown. 

Only recently have a few papers appeared in the literature that deal with 
graph representations of documents. Lopresti and Wilfong compare web doc- 
uments using a graph representation that primarily utilizes HTML parse in- 
formation, in addition to hyperlink and content order information [7]. In their 
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approach they use graph probing, which extracts numerical feature information 
from the graphs, such as node degrees or edge label frequencies, rather than 
comparing the graphs themselves. In contrast, our representation uses graphs 
created solely from the content, and we use the graphs themselves rather than a 
set of extracted features. Liang and Doermann represent the physical layout of 
document images as graphs [8] . In their layout graphs nodes represent elements 
on the page of a document, such as columns of text or headings, while edges indi- 
cate how these elements appear together on the page (i.e. spatial relationships). 
This method is based on the formatting and appearance of the documents when 
rendered, not the textual content (words) of a document as in our approach. 

Earlier work by the authors dealing with graph-based classification [9] [10] and 
clustering [11] [12] of web documents has been presented. In this paper we aim to 
give a synopsis of our previous work and a general framework for web document 
mining using a graph model of web document content. We also introduce a new 
graph representation model and compare it to the ones previously proposed. The 
paper is organized as follows. In Sect. 2 we will describe several techniques for 
representing web document content by graphs. We demonstrate how the popular 
^-Nearest Neighbors classification method can be easily extended to work with 
these graphs in Sect. 3. Similarly, in Sect. 4, we explain the graph-theoretic 
extension of the fc-means clustering algorithm. Experimental results for both 
methods comparing the graph and vector approaches are presented in Sect. 5. 
Finally, some concluding remarks are provided in Sect. 6. 



2 Graph Representations of Web Documents 

In this section we describe methods for representing web documents using graphs 
instead of the traditional vector representations. All representations are based 
on the adjacency of terms in a web document. These representations are named: 
standard, simple, n-distance, n-simple distance, raw frequency and normalized 
frequency. 

Under the standard method each unique term (word) appearing in the doc- 
ument, except for stop words such as “the”, “of”, and “and” which convey little 
information, becomes a node in the graph representing that document. Each 
node is labeled with the term it represents. Note that we create only a single 
node for each word even if a word appears more than once in the text. Second, 
if word a immediately precedes word b somewhere in a “section” s of the docu- 
ment, then there is a directed edge from the node corresponding to term a to the 
node corresponding to term b with an edge label s. We take into account certain 
punctuation (such as periods) and do not create an edge when these are present 
between two words. Sections we have defined for the standard representation are: 
title, which contains the text related to the document’s title and any provided 
keywords (meta-data); link, which is text that appears in hyper-links on the doc- 
ument; and text, which comprises any of the visible text in the document (this 
includes text in links, but not text in the document’s title and keywords). Next 
we remove the most infrequently occurring words on each document, leaving at 
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most m nodes per graph (to being a user provided parameter). This is similar 
to the dimensionality reduction process for vector representations [6]. Finally we 
perform a simple stemming method and conflate terms to the most frequently 
occurring form by re-labeling nodes and updating edges as needed. An example 
of this type of graph representation is given in Fig. 1. The ovals indicate nodes 
and their corresponding term labels. The edges are labeled according to title 
(TI), link (L), or text (TX). The document represented by the example has the 
title “YAHOO NEWS”, a link whose text reads “MORE NEWS”, and text con- 
taining “REUTERS NEWS SERVICE REPORTS” . Note there is no restriction 
on the form of the graph and that cycles are allowed. While this approach to 
document representation appears superficially similar to the bigram, trigram, 
or N-gram methods, those are statistically-oriented approaches based on word 
occurrence probability models [13]. The methods presented here, with the ex- 
ception of the frequency representations described below, do not require or use 
the computation of term probability relationships. 




Fig. 1. Example of a standard graph representation of a document. 



The second type of graph representation we will look at is what we call the 
simple representation. It is basically the same as the standard representation, 
except that we look at only the visible text on the page, and do not include 
title and meta-data information (the title section). Further, we do not label the 
edges between nodes so there is no distinction between link and text sections. 
An example of this type of representation is given in Fig. 2. 




Fig. 2. Example of a simple graph representation of a document. 



The third type of representation is called the n-distance representation. Un- 
der this model, there is a user-provided parameter, n. Instead of considering only 
terms immediately following a given term in a web document, we look up to n 
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terms ahead and connect the succeeding terms with an edge that is labeled with 
the distance between them (unless the words are separated by certain punctu- 
ation marks). For example, if we had the following text on a web page, “AAA 
BBB CCC DDD”, then we would have an edge from term AAA to term BBB 
labeled with a 1, an edge from term AAA to term CCC labeled 2, and so on. 
The complete graph for this example is shown in Fig. 3. 




Fig. 3. Example of an n-distance graph representation. 



Similar to n-distance, we also have the fourth graph representation, n-simple 
distance. This is identical to n-distance, but the edges are not labeled, which 
means we only know that the distance between two connected terms is not more 
than n. 

The fifth graph representation is what we call the raw frequency represen- 
tation. This is similar to the simple representation (adjacent words, no section- 
related information) but each node and edge is labeled with an additional fre- 
quency measure. For nodes this indicates how many times the associated term 
appeared in the web document; for edges, this indicates the number of times the 
two connected terms appeared adjacent to each other in the specified order. The 
raw frequency representation uses the total number of term occurrences (on the 
nodes) and co-occurrences (edges). 

A problem with this representation is that large differences in document size 
could lead to skewed comparisons, similar to the problem encountered when 
using Euclidean distance with vector representations of documents. Under the 
normalized frequency representation, instead of associating each node with the 
total number of times the corresponding term appears in the document, a nor- 
malized value in [0, 1] is assigned by dividing each node frequency value by the 
maximum node frequency value that occurs in the graph; a similar procedure is 
performed for the edges. Thus each node and edge has a value in [0, 1] associ- 
ated with it, which indicates the normalized frequency of the term (for nodes) 
or co-occurrence of terms (for edges). 

3 fc-Nearest Neighbors Classification with Graphs 

In this section we describe the fc-Nearest Neighbors (/c-NN) classification algo- 
rithm and how we can easily extend it to work with the graph-based representa- 
tions of web documents described above. The basic k-NN algorithm is given as 



406 



Adam Schenker et al. 



follows [14]. First, we have a set of training examples; in the traditional A:-NN 
approach these are numerical feature vectors. Each of these training instances is 
associated with a label which indicates to what class the instance belongs. Given 
a new, previously unseen instance, called an input instance, we attempt to esti- 
mate which class it belongs to. Under the A:-NN method this is accomplished by 
looking at the k training instances closest (i.e. with least distance) to the input 
instance. Here fc is a user provided parameter and distance is usually defined to 
be the Euclidean distance: 
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where Xi and jji are the ith components of vectors x = [xi,X 2 , ■ ■ ■ , Xn] and y = 
[yij 2 / 2 , ■ • ■ , 2/n], respectively. However, for applications in document classification, 
the cosine or Jaccard similarity measures [6] are often used due to their length 
invariance property. We can convert these to a distance measure by the following: 
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In Eq. (2) • indicates the dot product operation and || • • • || indicates the mag- 
nitude (length) of a vector. 

Once we have found the k nearest training instances using some distance 
measure, such as one of those defined above in Eqs. (1-3), we estimate the 
class by the majority among the k instances. This class is then assigned as the 
predicted class for the input instance. If there are ties due to more than one 
class having equal numbers of representatives amongst the nearest neighbors 
we can either choose one class randomly or we can break the tie with some 
other method, such as selecting the tied class which has the minimum distance 
neighbor. For the experiments in this paper we will use the latter method, which 
in our experiments has shown a slight improvement over random tie breaking. 
In order to extend the fc-NN method to work with graphs instead of vectors, we 
only need a distance measure which computes the distance between two graphs 
instead of two vectors. This graph-theoretic distance measure, which utilizes the 
concept of the maximum common subgraph, is [15]: 



distMCs{Gi,G2) = 1 — 
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( 4 ) 



where Gi and G 2 are graphs, mcs{Gi, G 2 ) is their maximum common subgraph, 
max(- • •) is the standard numerical maximum operation, and | • • • | denotes the 
size of the graph. Usually, this is taken to be the sum of the number of nodes 
and edges in the graph. However, for the frequency-based graph representations 
described in Sect. 2, the graph size is defined as the sum of the node frequency 
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values added to the sum of the edge frequency values. Other similar graph dis- 
tance measures have been proposed as well [16] [17]. 

For general graphs the computation of the maximum common subgraph is 
NP-Complete. However, for the graph representations of web documents pre- 
sented in Sect. 2, the computation of the maximum common subgraph is O(n^), 
with n being the number of nodes, due to the existence of unique node labels 
in the graph representations (i.e. we need only examine the intersection of the 
nodes, since each node has a unique label) [18]. We will give some examples of 
actual run times when using the graph-based methods in Sect. 5. 

4 fc-Means Clustering with Graphs 

The fc-means clustering algorithm is a simple and straightforward method for 
clustering data [14]. The basic algorithm is given in Fig. 4. As with the /c-NN 
classification algorithm, the typical procedure is to represent each item to be 
clustered as a vector in the Euclidean space 3?"*. The distance measure used by 
the algorithm is usually one of Eqs. (1-3). 



Inputs: the set of n data items and a parameter, k, defining the number of clusters to create 

Outputs: the centroids of the clusters and for each data item the cluster (an integer in [1,/:]) it 

belongs to 

Step 1. Assign each data item randomly to a cluster (from 1 to k). 

Step 2. Using the initial assignment, determine the centroids of each cluster. 

Step 3. Given the new centroids, assign each data item to be in the cluster of its closest centroid. 

Step 4. Re-compute the centroids as in Step 2. Repeat Steps 3 and 4 until the centroids do not 

change. 



Fig. 4. The basic fc-means clustering algorithm. 



As we saw in Sect. 3, we have available to us a method of computing dis- 
tances between graphs [Eq. (4)]. However, note that the fc-means algorithm, like 
many clustering algorithms, requires not only the computation of distances, but 
also of cluster representatives. In the case of /c-means, these representatives are 
called centroids. Thus we need a graph-theoretic version of the centroid, which 
itself must be a graph, if we are to extend this algorithm to work with graph 
representations of web documents. Our solution is to compute the representa- 
tives (centroids) of the clusters using median graphs [19]. The median of a set 
of graphs is defined as the graph from the set which has the minimum average 
distance to all the other graphs in the set. Here the distance is computed with 
the graph-theoretic distance measure mentioned above [Eq. (4)]. 

We wish to clarify a point that may cause some confusion. Clustering with 
graphs is well established in the literature. However, with those methods the 
entire clustering problem is treated as a graph, where nodes represent the items 
to be clustered and weights on edges connecting nodes indicate the distance 
between the objects the nodes represent. The goal is to partition this graph. 
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breaking it up into several connected components which represent clusters. The 
usual procedure is to create a minimal spanning tree of the graph and then 
remove the remaining edges with the largest weight until the number of desired 
clusters is achieved [20]. This is very different than the technique we described 
in this section, since it is the data (in this case, the web documents) themselves 
which are represented by graphs, not the overall clustering problem. 



5 Experiments and Results 

In order to evaluate the performance of the graph-based fc-NN and fc-means 
algorithms as compared with the traditional vector methods, we performed ex- 
periments on two different collections of web documents, called the F-series and 
the J-series [21]^. These two data sets were selected because of two major rea- 
sons. First, all of the original HTML documents are available, which is necessary 
if we are to represent the documents as graphs; many other document collections 
only provide a pre-processed vector representation, which is unsuitable for use 
with our method. Second, ground truth assignments are provided for each data 
set, and there are multiple classes representing easily understandable groupings 
that relate to the content of the documents. Some web document collections are 
not labeled or are presented with some other task in mind than content-related 
classification (e.g. building a predictive model based on user preferences). 

The F-series originally contained 98 documents belonging to one or more of 
17 sub-categories of four major category areas: manufacturing, labor, business 

6 finance, and electronic communication & networking. Because there are mul- 
tiple sub-category classifications from the same category area for many of these 
documents, we have reduced the categories to just the four major categories 
mentioned above in order to simplify the problem. There were five documents 
that had conflicting classifications (i.e., they were classified to belong to two 
or more of the four major categories) which we removed in order to create a 
single class classification problem, which allows for a more straightforward way 
of assessing classification accuracy. The J-series contains 185 documents and 
ten classes: affirmative action, business capital, information systems, electronic 
commerce, intellectual property, employee rights, materials processing, personnel 
management, manufacturing systems, and industrial partnership. We have not 
modified this data set. Additional results on a third, larger data set can be found 
in [9] [10] [12]. 

For the vector-model representation experiments there were already several 
pre-created term-document matrices available for our experiments at the same 
location where we obtained the two document collections. We selected the ma- 
trices with the smallest number of dimensions. For the F-series documents there 
are 332 dimensions (terms) used, while the J-series has 474 dimensions. We per- 
formed some preliminary experiments and observed that other term-weighting 

^ The data sets are available under these names at: ftp://ftp.cs.umn.edu/dept/ 
users/boley/PDDPdata/ 
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schemes (i.e., tf • idf, see [6]) improved the accuracy of the vector-model repre- 
sentation for these data sets either only very slightly or in many cases not at all. 
Thus we have left the data in its original format. 

For our graph-based experiments we used a maximum graph size of 30 nodes 
per graph; this corresponds to setting m = 30 (see Sect. 2). This parameter 
value was selected based on previous experimental results, and has been shown 
to work adequately for both data sets (see also the study by Turney [22]). Further 
results are omitted for brevity. For the distance related graph representations, 
n-distance and n-simple distance, we used n = 5 (i.e. 5-distance and 5-simple 
distance). Classification accuracy was assessed by the leave-one-out method. 
Clustering performance is measured using two performance indices which indi- 
cate the similarity of obtained clusters to the “ground truth” clusters. The first 
performance index is the Rand index [23] , which is computed by examining the 
produced clustering and checking how closely it matches the ground truth clus- 
tering. It produces a value in the interval [0, 1], with 1 representing a clustering 
that perfectly matches ground truth. The second performance index we use is 
mutual information [24], which is an information-theoretic measure that evalu- 
ates the overall degree of agreement between the clustering under consideration 
and ground truth, with a preference for clusters that have high purity. Higher 
values of mutual information indicate better performance. The clustering exper- 
iments were repeated ten times to account for the random initialization of the 
/c-means algorithm, and the average of these experiments is reported. 

The results for the /e-NN classification experiments are given in Table 1. The 
first column, DS, indicates which data set (F or J-series) the experiments were 
performed on. The next column, k, is the number of nearest neighbors used 
in that experiment. The next six columns are the results of using our graph- 
theoretic approach when utilizing the various types of graph representations of 
web documents described in Sect. 2. These are, from left to right: standard, 
simple, 5-distance, 5-simple distance, raw frequency, and normalized frequency. 
The final column is the accuracy of the vector representation approach using 
a distance based on Jaccard similarity [6], which is the best performing vector 
distance measure we have worked with. On our system, a 2.6 GHz Pentium 4 
with 1 gigabyte of memory, the graph method took an average of 0.2 seconds 
to classify a document for the F-series, and 0.45 seconds for the J-series, both 
when using k = 1 and the standard representation. 

Similarly, the result for the /c-means clustering experiments are given in Ta- 
ble 2. The columns are identical to Table 1, with the exception of the second 
column, which indicates the performance measures (PM) used: Rand index (R) 
or mutual information {Ml). The average time to create clusters for the F-series 
using the graph-based method and the standard representation was 22.7 seconds, 
while it took 59.5 seconds on average for the J-series. 

From the results we see that the standard representation, in all experiments, 
exceeded the equivalent vector procedure. In 11 out of 12 experiments, the simple 
representation outperformed the vector model. The 5-distance representation 
was better in 8 out of 12 experiments. The 5-simple distance representation was 
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Table 1. Experimental results for fc-NN classification. Classification accuracy is given 
as percent of documents correctly classified using leave-one-out. The best results of 
each experiment are shown in bold. 



DS 


k 


Std 


Sim 


5-D 


5-SD 


RF 


NF 


Vec 


F 


1 


96.77% 


94.62% 


92.47% 


93.55% 


93.55% 


97.85% 


93.55% 


F 


3 


95.70% 


97.85% 


94.62% 


96.77% 


96.77% 


96.77% 


94.62% 


F 


5 


96.77% 


96.77% 


93.55% 


94.62% 


93.55% 


93.55% 


88.17% 


F 


10 


95.70% 


94.62% 


95.70% 


92.47% 


91.40% 


92.47% 


83.87% 


J 


1 


81.08% 


80.00% 


82.16% 


82.16% 


81.08% 


76.76% 


73.51% 


J 


3 


84.86% 


81.62% 


82.70% 


83.78% 


81.62% 


80.00% 


74.59% 


J 


5 


85.41% 


85.95% 


83.78% 


83.78% 


80.00% 


81.62% 


74.05% 


J 


10 


85.41% 


84.86% 


83.78% 


82.16% 


77.30% 


84.32% 


77.30% 



Table 2. Experimental results for fc-means clustering. Performance is measured with 
Rand index (R) and mutual information (MI). The best results of each experiment are 
shown in bold. 



DS 


PM 


Std 


Sim 


5-D 


5-SD 


RF 


NF 


Vec 


F 


R 


0.7202 


0.7064 


0.6912 


0.7056 


0.7186 


0.7003 


0.6899 


F 


MI 


0.1604 


0.1323 


0.1371 


0.1457 


0.1404 


0.1319 


0.1020 


J 


R 


0.8741 


0.8692 


0.8663 


0.8605 


0.8639 


0.8660 


0.8717 


J 


MI 


0.2487 


0.2400 


0.2305 


0.2107 


0.2204 


0.2516 


0.2316 



an improvement in 9 out of 12 cases. Raw frequency was better in 8 of 12 cases, 
while normalized frequency was an improvement in 11 of 12 cases. 

For the classification experiments, the best accuracy for the F-series was 
97.85%, which was achieved by both the simple representation (for fc = 3) and the 
normalized frequency representation (for k = 1). In contrast, the best accuracy 
using a vector representation was 94.62% (for fc = 3). For the J-series, the best 
graph-based accuracy was 85.95% (for simple, k = 5); the best vector-based 
accuracy was 77.30%. 

For the clustering experiments, the best F-series results were attained by the 
standard representation (0.7202 for Rand index; 0.1604 for mutual information). 
The performance of the vector approach was 0.6899 and 0.1020 for Rand and mu- 
tual information, respectively. For the J-series, the best Rand index was obtained 
for standard (0.8741) while the best mutual information value was attained for 
normalized frequency (0.2516). In comparison, the vector-based clustering for 
the J-series achieved 0.8717 for Rand index and 0.2316 for mutual information. 

6 Conclusions 

In this paper we have provided a description of several methods of representing 
web document content as graphs, rather than the vector representations which 
are typically used. We introduced six different types of graph representations 




A Graph-Based Framework for Web Document Mining 411 



(standard, simple, n-distance, n-simple distance, raw frequency, and normalized 
frequency) and performed experiments with both classification using fc-Nearest 
Neighbors and clustering using /c-means on two web document collections. The 
results showed an improvement over the traditional vector representation in most 
cases for both classification and clustering. Overall, the standard representation 
was the best, with the simple and normalized frequency representations also 
performing well. We will examine the reasons for the variations in performance 
for different representations in future research. 

In future experiments we will utilize larger data sets, to investigate scalability 
issues, and examine additional graph representation types. The graph represen- 
tations of web documents presented here can capture term order, proximity, 
section location, and frequency information. We could look at more elaborate 
representations that capture additional information. Note that new represen- 
tations would not necessitate the creation of new algorithms in order to make 
use of them. Incorporating expert knowledge from the field of natural language 
understanding (e.g. the first sentence of the paragraph is the most important; 
words that appear in italics or boldface are important terms) could also improve 
the quality of the created graphs. We will also consider extending the graph- 
based techniques presented here to other data mining methods. Another open 
topic to be explored is finding a procedure for determining the optimal size of 
each graph (i.e. parameter m) automatically. 
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Abstract. The Norme in rete (NIR) [Legislation on the Net] national 
project aims at making easier the retrieval and the navigation between 
legal documents in a distributed environment and to encourage the de- 
velopment of systems with characteristics of interoperability and effec- 
tive of use. In order to obtain this, two standards have been defined: a 
URN standard, to identify these materials through uniform names, and 
XML-DTDs to describe legislative documents within the NIR domain. 
In this paper the dehnition of such standards and the developments of 
tools aimed at making easier their adoption are illustrated. Particularly 
this paper presents a specific law drafting environment, NIREditor, able 
to produce legal documents and to handle legacy legislative documents 
according to the NIR standards. 



1 Introduction 

Access to legal information for citizens is one of the main democracy objectives. 
Users and legal experts increasingly feel the need to retrieve legal documents 
from the Web and the links between them in order to learn about the law and 
fully understand legal texts. To implement these services and to eliminate infor- 
mation historical fragmentation in legislative environment, in Italy the “Norme 
in Rete” (NIR) project (“Legislation on the Net”) has been proposed by the 
CNIPA [Italian National Center for Information Technology in the Public Ad- 
ministration] in conjunction with the Italian Ministry of Justice. The project 
aims at creating a unique access point on the Web with search and retrieval ser- 
vices of legal documents, as well as a mechanism of stable cross-references able 
to guide users towards relevant sites of public authorities participating in the 
project. To achieve these purposes, the NIR project proposed the adoption of 
XML as a standard for representing law documents. Particularly, the project pro- 
posed a description of law texts by three DTDs with increasing degree of depth: 
they aim at representing a legal text with respect to its structural or formal pro- 
file, and using particular meta-information to its semantic or functional profile. 
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Moreover a uniform cross-referencing system, based on URN standards [1], able 
to provide a stable system of cross-referencing has been established. In order 
to make easier the adoption of such standards, some tools have been developed 
within the NIR project. In particular, in this paper, the NIREditor authoring 
tool is presented, which includes facilities and modules aiming at managing new 
or legacy law documents according to the established standards. 

In Section 2 the standards established within the project are illustrated. In 
Section 3 the main features of NIREditor are presented: particularly in Sections 
3.1, 3.2, 3.3 different working situations are described. Finally, in Section 4, some 
conclusions are discussed. 

2 The NIR Standards 

The feasibility study of the NIR project proposed the adoption of XML as a stan- 
dard for representing legal documents. This study aimed at representing a legal 
text with respect to its formal structure, using also additional meta-information 
and a uniform cross-referencing system providing documents with characteristics 
of interoperability and effective of use. This preliminary study, carried on by two 
specific national work groups produced two main official standards: 

1. a standard for cross-referencing legal documents has been defined in accor- 
dance with the uniform name (URN) technique: an unambiguous identifier, 
that allows the references to be expressed in a stable way, independently of 
their physical location; 

2. a standard for legal document description has been formulated by defining 
XML-DTDs (NIR-DTDs) of increasing degree of depth in text hierarchy 
description for different kind of legal documents (similar initiative is the 
MetaLex project [2]). As well as including the NIR-URN standard for cross- 
references, the NIR-DTDs provides: 

- a structural description of text, establishing constraints in the hierarchy 
of the formal elements of a legislative text (collections of articles); 

- a specification of the metadata which can be applied to a legislative 
document or to parts of it. 



2.1 The URN Standard 

Within the NIR project, documents are identified through a uniform name. Uni- 
form Resource Names (URNs) were conceived by the Internet community for 
providing unambiguous and lasting identifiers, independent of physical location, 
of network resources. In legal documents, references to other legislative measures 
are very frequent and extremely important. The hypertext links of the Web meet 
this need, but do not appear to be suitable for wide-scale use in the law: reference 
to the resource referred to is, in fact, based on its physical location expressed in 
a uniform mode through its Uniform Resource Location (URL), which presents 
the following well-known problems: 
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- difficulty in knowing the location of the cited resource; 

- the loss of validity over time of the locations (URL) in the references; 

- the impossibility of referring to resources that have not been published yet; 

which, therefore, make the network of links between documents extremely limited 
with respect to their potential and to their increasing unreliability over time. 

In order to avoid these problems, a system of references based on assign- 
ing a uniform name to each legal resource and on resolution methods (RDS: 
Resolver Discovery Service) able to retrieve the corresponding object has been 
chosen. These tools are in conformity with those defined within IETF (Internet 
Engineering Task Force) by the special working group (URN Working Group) 
and described in various documents - from the official standards (RFC: Request 
For Comments [3], [4], [5]) to the drafts - to which alignment is guaranteed 
even in the future. A uniform name as an unambiguous identifier to every le- 
gal document is assigned in a standardized format, that only depends on the 
characteristics of the document itself and is, therefore, independent of on-line 
availability, of physical location and of access mode. This identifier is used as a 
tool for representing the references - and more generally every type of relation 
- between the legal acts. In an on-line environment with distributed resources 
between different Web publishers, its use facilitates the construction of a global 
hypertext between legal documents and a knowledge base storing the relations 
interconnecting them. The association of the uniform name to the document 
occurs through meta-information, that may be: 

- inserted in the document itself: it is the solution that can be adopted in 
HTML files (through the META tag) and also in XML files (through a 
suitable tag); 

- external but strictly related to the document: by traditional techniques as a 
specific attribute in a database, or using growing methods as adopting RDF 
technology. 

In any case, the software tools used must be able to implement and update the 
(distributed or centralized) catalogues which are functional for resolution and, 
therefore, to give access to the document through the uniform name. Other meta- 
information (for example, details, title, subject-matter, relations, whether in 
force, etc.) which enrich the system response, can be present in these catalogues 
that store the uniform name and location for each document. The uniform names 
system of the domain of interest must include: 

- a schema for assigning names capable of representing unambiguously any 
legal measure, issued by any authority at any time (past, present and future); 

- a resolution mechanism - in a distributed way - from uniform name to on-line 
location of the corresponding resources. 

Uniform names in the law, as proposed by a special NIR working group has 
been adopted as a technical regulation by Italian legislative system. In conformity 
with RFC 2141 URN Syntax [3], which defines the general syntax of a uniform 
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name, for legal documents a name-space identified by “nir” (this space identifies 
the context in which the names are valid and significant) has been defined and, 
therefore, the relative URN have the following format: 

<URN> : := "urn: nir:" <NSS-nir> 

The specific name <NSS-nir> must contain information appropriate for un- 
ambiguously identifying the document. In the legal domain they are essentially 
four data: the enacting authority (or the authority referred to), the type of 
measure, the details and any annex. For legislation, it is also necessary to distin- 
guish between any later versions of the document, following amendments that 
have been made over a period of time. In this case, the identifiers of the legisla- 
tive act remain the same, but information is added regarding the version under 
consideration. Therefore, the more general structure of the specific name appears 
as follows: 



<NSS-nir> ::= <document> <version>] 

A structure for identifying the document is defined, composed of the four 
fundamental elements mentioned above, clearly distinguished one from another 
in accordance with an order identifying increasingly narrow domains and com- 
petence: 

<document> ::= <authority> <measure> <details> [" : " <annex>] 

The main elements of the uniform name are generally divided into several 
elementary components, each having established rules of representation (criteria, 
modes, syntax and order). Such a syntax allows the automatic construction of 
the URN, starting from the text of the citation. The complete syntax specifica- 
tion of the uniform names belonging to the “nir” name-space can be seen in [1], 
whilst some important examples of uniform names of legal documents are: 

Act 24 November 1999, No. 468 

urn : nir : stato : legge : 1999-11-24; 468 

Decree of Ministry of Finance of 20.12.99 

urn : nir : ministero . f inanze : decreto : 1999-12-20 ; nir-3 

Decision of the Italian Constitutional Court No. 7 of 23 January 1995 

urn : nir : corte . costituzionale : sentenza: 1995-01-23 ; 7 

To each uniform name, the system of resolution has the task of associating 
the respective network locations. It is based, within a distributed architecture, 
on two basic components: a chain of information in DNS (Domain Name System) 
and a series of resolution services from URNs to URLs, each competent within 
a specific domain of the name space. Particular attention has been paid to the 
resolution system in order to provide an answer to the user, even in case of 
uncompleted or uncorrected uniform names, derived from uncorrected citations 
(for example the resolution service gives back the list of the documents whose 
URNs partially match the provided URN, or it attempts to correct automatically 
the URN itself). 
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2.2 The NIR-DTDs Standard 

As well as the NIR-URN standard, the NIR project has defined a standard 
based on XML, aimed at describing the content of legislative documents. For 
this purpose three DTDs with increasing degree of depth have been established: 

- the “DTD fiessibile” (niloose.dtd) contains about 180 elements: it does not 
establish any mandatory rules (unless in a very small quantity) and it is used 
for legacy legislative documents not following drafting rules; 

- the “DTD base” (nirlight.dtd) contains about 100 elements: it represents a 
subset of the “DTD complete”: it is useful to train users in adopting the 
DTD standards; 

- the “DTD complete” (nirstrict.dtd) contains about 180 elements: it follows 
legislative drafting rules and it is used to write new legal documents. 

The “DTD fiessibile” and “DTD complete” are composed by four common files: 

1. global. dtd: containing general definitions; 

2. norme.dtd: containing definitions of the division structures; 

3. text. dtd: for text, table and form structure definitions; 

4. meta.dtd: containing metadata schemes definitions. 

Differences are present in the main files nirstrict.dtd and nirloose.dtd. The 
nirstrict.dtd establishes an order to the partitions of a law text. Collections of 
articles are still considered the basic elements of the norm (their numbering is 
independent from the hierarchical organization of the other elements) . Number- 
ing of the divisions is mandatory. Titles of the divisions are not provided, while 
they are optional for the other elements. The nirloose.dtd establishes only few 
constraints and it is used for legacy legislative documents which usually do not 
follow particular legislative drafting rules. The NIR-DTDs basically describe a 
legislative text under two profiles: 

- the formal profile which considers a legislative text as made up of divisions; 

- the functional profile which considers a legislative text as composed by ele- 
mentary components called provisions (fragment of a regulation) [6]. 

In other words, the fragments of text inserted have a formal and a functional 
appearance. They are, at the same time, partitions and provisions, according to 
whether they are seen from a formal or functional view-point. The two points of 
view can be alternated as required during the definition of the text. 

In particular the functional profile can also be considered as composed by two 
sub-profiles: the regulative profile and the thematic profile. The first one reflects 
the lawmaker directions, the second one the peculiarities of the regulated field. 
On the NIR-DTDs point of view, the regulative profile is identified by particular 
metadata called analytical provisions, the thematic profile are partly illustrated 
in the so-called subjects of the provisions. 
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Fig. 1. The NIREditor and its connections to general-purpose XML editors. 

3 The NIREditor 

The NIR-DTDs identify a wide and complex subset of documents: basically 
law texts and regulative acts. The production of new documents, as well as the 
transformation of legacy contents according to the NIR standards, can be a hard 
problem to face without an editing system guiding and supporting the user. 

Even though programs for drafting texts in XML already exist, we have de- 
cided to develop a specific environment to handle NIR-XML documents. The 
limits of present XML editors in fact, whether used for a specific class of doc- 
uments, concern the generality and inadequacy of their editing functions, in 
particular as regards functions implementing the NIR-DTDs constraints. 

Therefore, as well as for producing HTML documents according to the 
HTML-DTD, specialized editors exist, similarly to help law texts drafting ac- 
cording to NIR-DTDs standard, a specialized visual editor {NIREditor) has been 
developed [7] [2]: it consists of a law drafting environment supporting specific 
Italian legislative technique functions. The software architecture of NIREditor is 
represented by a kernel of Java specific functions library, fully integrated within 
the law drafting environment; they can also be integrated to the main XML 
general purpose editors supporting a Java API (Fig. 1). 

The NIREditor operates within the URN and DTD NIR framework and it 
is designed to assist the drafting of new texts, as well as to process legacy law 
texts. Two working situations are thus catered for: the processing of an existing 
text or the processing of new texts, with its different situations: composition and 
organization of new texts. In Figure 2 the NIREditor drafting environment is 
shown. 

3.1 Importing Texts 

In this case, instruments for recognizing the basic aspects of the texts are avail- 
able, which allow automatic pre-marking of all the parts of the structure recog- 
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Fig. 2. The NIREditor environment. 



nized in the text analyzed, in accordance with the NIR-DTD, thus recognizing 
the formal profile of the legislative text. This structure parser is designed to 
help the XML conversion of documents which otherwise would have to be car- 
ried out completely by hand; it includes also a cross-reference parser able to 
locate cross-references and to assign the related URNs; it is based on a grammar 
implementing a bottom-up parsing strategy. 

Currently the structure parser implements a non-deterministic finite-state 
automata (NFA), where the states are represented by the elements of the NIR- 
DTD, and the transitions among the states are associated to formal rules of 
document parts division. As well the cross-reference parser is constructed as a 
syntactical parser [8], [9], on the ground of a cross-reference grammar. In case of 
parsing errors, the completion, correction and validation of the pre-marking is 
possible using formal structure management functions, and a text panel where 
plain text can be handled. The result of the structure parsing function is the 
formal profile of the text which is established by the structural elements of the 
NIR-DTDs. 

A further way of marking a pre-existing text is represented by the application 
of the analytical metadata to a law text, therefore the recognition of the func- 
tional profile of a legislative text, whose schema is established by the NIR-DTDs. 
Such metadata are intended to qualify the provisions of a text law. Examples of 
provisons are duty, right, delegation, competence, power. As the marking of the 
formal structure, the insertion of analytical metadata for provision classification 
can be manually carried out, however this function can be particularly time con- 
suming. Therefore, within NIREditor a module supporting the user in provision 



420 Carlo Biagioli et al. 



classification, based on machine learning techniques for text classification has 
been developed: it extracts automatically from the text of the provisions their 
relevant meanings according to the NIR analytical metadata schemes. 

Considering a logical document (hereinafter simply “document” ) as the por- 
tion of a law text containing a provision, being V the set of the selected docu- 
ments, each document dj G V has been described by a feature vector. Consider- 
ing that the documents we deal with are usually rich of text, we have considered 
words as features. Other possible choices, as considering phrases, have been a 
priori discarded, since experiments of this approach ([10], [11]) did not produce 
significant better effectiveness. According to [12] a reason of this behaviour is 
that even if phrases usually have superior semantic qualities, their statistical 
qualities are usually inferior. 

Each document dj is therefore described by a vector of term weights dj = 
[wij, ..., w\r\j], where T is the set of words occurring at least once in at least one 
document; 0 < Wkj < 1 is calculated according to the standard tfidf function 
[13], [14], which considers the weight Wij as a function of the number of times 
the word occurs in dj. Then the provision classifier has been constructed 
in terms of automatic document classifier. According to [14], being T> a set of 
logical documents, containing a provision each, and C = {cq, ci, ..., C|c|} a set of 
categories, corresponding to as many types of provision, our provision classifier 
consists in the construction of a ranking classifier that for a given document 
dj it returns the scores for the different categories. The score for the class is 
defined in terms of the function CSVi'. T> [0,1] that, given a document dj G T>, 
returns a categorization status value for the document dj with respect to the class 
Ci- Such a score represents the evidence for a given document to belong to the 
class Ci- CSVi{dj) is obtained in terms of P{ci\dj), namely the probability that 
a document represented by the vector dj belongs to class c^. This probability is 
computed using the Bayes’ theorem, with the naive assumption that words in 
a document occur independently of each other given the class (1) {naive Bayes 
classifier): 

|X| 

P(cj|dj) = where: P{dj /c^) = P{wkj\c^). (1) 

P(dj) 

In (1) P{dj) is the probability that a randomly picked document is repre- 
sented by the vector dj, P{ci) the probability that a randomly picked document 
belongs to Ci and P{dj\ci) the probability that a document, belonging to class 
Ci, is represented by the vector dj. 

The reliability of the classifier has been tested considering a data set of 582 
provisions distributed among 11 classes (Tab. 1) representing as many types of 
provisions. The collected data set has been used both to train the na'ive Bayes 
classifier and to test the reliability of the approach. In order to reduce the com- 
plexity of the problem, a phase of feature selection, in our case words, has been 
performed. From the vocabulary related to the data set, we selected a number 
of words with the highest information gain, as defined in [15], representing the 
discriminative power of a word with respect to the classes. 
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Table 1. Classes (provisions) and number of documents for each class in the experi- 
ment. 



Class labels 


Classes of the data set 


Number of documents 


Co 


Repeal 


70 


Cl 


Definition 


10 


C2 


Delegation 


39 


C3 


Delegification 


4 


C4 


Duty 


13 


C5 


Reservation 


18 


C6 


Inserting 


121 


C7 


Prohibition 


59 


cs 


Permission 


15 


cg 


Penalty 


122 


ClO 


Substitution 


111 



Table 2. Test of the classifier on the training set. 



Classes] 


Co 


Cl 


C2 


C3 


C4 


C5 


Co 


C7 


Cs 


Cg 


ClO 


Co 


70 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


Cl 


0 


9 


0 


0 


0 


0 


0 


0 


0 


1 


0 


C2 


0 


0 


39 


0 


0 


0 


0 


0 


0 


0 


0 


C3 


4 


0 


0 


0 


0 


0 


0 


0 


0 


0 


0 


C4 


0 


0 


0 


0 


10 


1 


0 


1 


0 


1 


0 


C5 


0 


0 


0 


0 


0 


17 


0 


0 


0 


1 


0 


C6 


1 


0 


1 


0 


1 


0 


118 


0 


0 


0 


0 


C7 


0 


0 


0 


0 


0 


2 


0 


55 


0 


1 


1 


Cs 


0 


0 


0 


0 


0 


1 


0 


1 


12 


0 


1 


Cg 


0 


0 


0 


0 


0 


0 


0 


0 


0 


120 


2 


CIO 


3 


0 


0 


0 


0 


0 


0 


0 


0 


2 


106 



To train the classifier we have performed a stemming procedure^ on words 
to obtain a normalized vocabulary, so that different variants of the same word 
are considered as occurences of the same normalized form, since they contribute 
in the same way to the semantic of a text. Moreover we have considered only 
the n words of the vocabulary with the best information gain. The best results 
of the classifier have been obtained considering n = 500. In this configuration, 
the classification results on the training set obtained an accuracy of 95.5% Being 
Ci the class of the provision, the details of the classification results on the 
training set are reported in Tab. 2. The entry of the element (cj, Cj) represents 
the number of documents of class Ci classified in class cj. 

The generalization capability of the classifier has been tested using the “leave- 
one-out” strategy: all the collected examples are used to train the classifier mod- 
ule, except one which is not included in the training set but is used to test the 
classification capability of the module. This is repeated, leaving one different ex- 

^ http:/ /www. snowball. tartarus.org/italian/stemmer 




422 Carlo Biagioli et al. 



ample, at each step, out of the training set, till all the examples are used to test 
the classifier. The results of all the tests are combined, obtaining an evaluation 
of the reliability of the classifier on data from the training set. 

The results of the classification capability using the “leave-one-out” strategy 
obtained an accuracy of 88.6%. The details of the classification results on the 
training set are reported in Tab. 3. 



Table 3. Test of the classifier according to the “leave-one-out” strategy. 



Classes! Co ci C 2 cs C4 cs ce cr cs cg 



ClO 



Co 

Cl 

C2 

C3 

C4 

C5 

C6 

C7 

C8 

cg 

ClO 



67 00000 1 00 0 2 

040011 0 20 1 1 

00 39 000 0 00 0 0 

400000 0 00 0 0 

001031 0 402 2 

01000 11 0 12 1 2 

102010 114 10 0 2 

000003 1 53 0 1 1 

0010070230 2 

000000 0 00 120 2 

302001 1 00 2 102 



3.2 The Composition of New Texts 



For the composition of new texts, NIREditor is conceived as a visual editor, 
supporting the user in producing valid documents according to the chosen DTD. 
No XML validation function is contained within the editing environment, since 
the editor allows the user to perform only valid operations. Moreover, it helps 
the user in composing particular section of a new law document using dialogue 
windows, and permits the introduction of the metadata provided by the NIR- 
DTDs in the correspondent part of the document. The insertion of the XML 
formal partitions provided by the NIR-DTDs can be obtained by the editor guide 
which suggests the user the XML elements that can be introduced according to 
the context of the insertion point. 

Particular facilities available within the drafting environment are the auto- 
matic numbering of the divisions and the update of internal references in the 
event of text movements or variations. Automatisms are present as far as the 
construction of external and internal cross-references are concerned as well as 
instruments for the related URNs construction. 

It is possible to construct a new text by determining a priori the structure and 
insert the content of the various parts afterwards, or else passages can be inserted 
in no particular order, then organized and inserted into a suitable structure at 
a later time. During the composition, a further valorization of a legislative text 
is represented by the application of the analytical metadata and their subjects 
to the divisions. This can be done by hand or using the provision classifier as 
a support. In the event that metadata have been inserted, which are the result 
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of documentary requirements, it is possible to make use of these notes to help 
in determining a fine logical structure of the text being processed, as well as for 
subsequent network information searches. 



3.3 The Organization of New Texts 

For the organization a posteriori of new texts, two alternative strategies can be 
followed: the formal strategy and the functional strategy [6], [16]. 

The formal strategy considers the text according to the formal profile: the 
text is made up of divisions (collection of articles). Using the formal strategy 
the partitions of similar rank to be organized are chosen by the draftsman. The 
editor will create a new part of an immediately higher rank, applying the rules 
of formal text structuring to the same. 

The functional strategy considers the text according to the functional profile: 
the elementary component of a text is a provision (fragment of a regulation). 
The draftsman carries out the same operations in an indirect way: the partitions 
to be organized are chosen according to their content, affinities etc. as well as it 
is decided where they should be placed in the text, according to the preferences 
of the drafter and the customary procedure of presentation used in some rules of 
legislative technique. The attention to the functional profile of a legislative text 
based on analytical metadata is one of the key points of NIREditor, this is the 
precondition of creating at least a domain-specific semantic portion of the Web. 

4 Conclusion 

In this paper the standards established for publishing law documents within 
a distributed architecture, based on DTD-XML and URN for cross-references, 
have been presented. Such standards has been established within the NIR project 
promoted by the Italian Ministry of Justice and the Italian National Center for 
Information Technology in the Public Administration. In order to make easier 
the adoption of such standards, some tools have been developed. In this paper, 
in particular, we have presented a visual editing system, NIREditor, able to 
produce new law documents, as well as the transformation of legacy contents 
according to the NIR standards. 
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Abstract. Structural analysis of web pages has been proposed several times and 
for a number of reasons and purposes, such as the re-flowing of standard web 
pages to fit a smaller PDA screen. ellSA is a rule-based system for the analysis 
of regularities and structures within web pages that is used for a fairly different 
task, the determination of editable text blocks within standard web pages, as 
needed by the IsaWiki collaborative editing environment. The ellSA analysis 
engine is implemented as a XSLT meta-stylesheet that applied to a rule set gen- 
erates an XSLT stylesheet that, in turn, applied to the original HTML document 
generates the requested analysis. 



1 Introduction 

The HTML language was born to provide a simple and easy-to-learn markup lan- 
guage for scientists to write their papers and publish them on the Internet. With time, 
most markup of a web page has come to refer to graphics and layout, and the actual 
content of the page is now often heavily intermixed to structure and graphics, and it is 
rather hard to extract and analyze. 

Applications aimed at providing different functionalities on web content than visu- 
alization have therefore a hard task in extracting relevant information from web 
pages. Sophisticated searching and adaptive content delivery are just two examples 
where the actual HTML code provided by the web server needs to be decomposed, 
analyzed and recomposed in a different manner for the application to deliver its func- 
tionalities. 

In this paper we introduce a new kind of application where the structural analysis 
of HTML documents is necessary: the in-place editing of web pages for customiza- 
tion of content. Personal variants of web pages have been at the basis of the original 
vision for distributed hypertext systems, as described by Ted Nelson for his Xanadu 
environment [14]. Wikis [7] are recent tools that re-introduce universal editing of web 
pages and reduce considerably the complexity of the task of editing web content. On 
the other hand the presentation complexity of wiki pages is rather meager, and the 
application only allows editing of pages belonging to the wiki itself. 

The IsaWiki system [8] aims at providing easy tools for accessing and editing 
pages belonging to any web site. By allowing in-place editing of any web page, and 
by storing the edited content in the wiki site, ready for delivery any time the same 
user accesses the same page again, IsaWiki aims at creating an environment for per- 
sonal customization of web content much beyond the reach of wiki systems. 

One of the issues discussed in the design of the IsaWiki system was whether the 
user would be interested in editing any part of the web page, or whether we should 
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provide some tools to identify those parts of the HTML code that contained the actual 
content, and provide editing facilities just for them. Having discarded the idea of just 
providing a complete access to the whole of the web page (the layout and overall 
structure of a professionally designed HTML page is often rather complex, difficult to 
understand even to expert web designers), we decided to consider a web page as a 
layout of independent substructures, and to provide editing access to just those sub- 
structures that contain the actual and specific content of the page. 

For this reason we implemented a client-side tool, called ellSA (Extraction of Lay- 
out Information via Structural Analysis), that is aimed at providing a complete struc- 
tural analysis of the web pages, in order to determine their content areas and activate 
the editing facilities just on them. 

The ellSA system is a rule-based application that seeks patterns in the HTML code 
and labels structures and substructures of the HTML page according to these patterns. 
It is actually implemented as a transformation engine that, once applied to HTML 
pages, adds appropriate attributes and other markup to the original document. The 
editing part of the IsaWiki system will then search those attributes and activate the 
editing facilities just on those parts that are marked as content areas. 

Of course these rules are far from perfect and universal. They are based on an 
analysis of the most frequent cases of HTML structures as described in the literature 
and available on the web, so they certainly do not cover all situations. Furthermore 
best practice evolve with time, so it may well be that new types of pages (with differ- 
ent applicable rules) become fashionable on the web in the future. For this reason, 
ellSA clearly differentiates between the rule engine and the actual rules, which are 
declaratively expressed using an appropriately defined XML-based language. New 
fashions in web page constructions, or new and more precise rules for existing struc- 
ture, then, will require no intervention in the fundamental ellSA engine, but just the 
specification of one or more additional declarative rules. 

In this paper we wish to describe the ellSA analysis tool, and show how it fits in 
the overall IsaWiki application. In section 2 we plan to discuss a few related works in 
the area of web page analysis, and describe the kind of rules they apply in their activi- 
ties. In section 3 we describe IsaWiki and its overall architecture, providing a justifi- 
cation for the creation of the ellSA system. Section 4 describes the overall architec- 
ture of ellSA, including the basic overall transformation mechanism (based on XSLT 
meta-stylesheets) and the syntax for expressing analysis rules. Section 5 describes the 
actual rules we are now using in our analysis of web pages, and that are in many cases 
drawn from the systems described in section 2 and re-expressed in the syntax de- 
scribed in section 4. 



2 Structural Analysis of Web Pages 

The HTML language was born to provide a simple and easy-to-learn markup lan- 
guage for physicists to write their papers and publish them on the Internet. The lan- 
guage provided initially just a few structural and typographical constructs as well as 
the new and relevant concept of the hypertext link. 

The advent of the Mosaic browser paved the road for the World Wide Web to be- 
come the main tool for the information highways. Soon, every individual, company 
and organization had to have their own web site, and just could not accept the idea of 
settling for the kind of simple pages that HTML, properly used, allowed them. By 
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means of HTML tricks, tag exploitation, and creative use of graphics, page designers 
have managed to make a boring structural markup language become the means for 
incredibly complex and sophisticated interactive events available through web brows- 
ers. 

Of course, the drawback of this evolution was that most professionally created web 
pages started to include large markup sections aimed at decorative and layout pur- 
poses, and that these are often intermingled with the actual content of the page: al- 
though human readers can in most cases easily tell apart content and presentation, 
machine interpretation of the content is seriously hindered. 

Not even XML has provided a solution for this situation: although XML has been a 
huge success and a technology and implemented by almost every player in the soft- 
ware industry, its main use nowadays is behind the scenes, server-side, while what 
gets delivered to the browser is still just an HTML document generated on the fly. 

Still, applications that require to identify and classify subparts of web pages is 
high, and their aims are as wide and numerous as the applications themselves, but 
surely include content repurposing for special devices (e.g. mobile computers), data 
mining, and summarization. Even the techniques they implement vary considerably. 

For instance in [11] a hierarchical representation of the screen coordinates of each 
page element is used to determine the most common areas in the page, such as header, 
side menu (either left or right), main content area, and footer. This analysis exploits 
the expected structural similarities between professionally designed pages as sug- 
gested by usability manuals and implemented by competitors. 

[20], [13] and [15] all propose a semantic analysis of the structure of the HTML 
page, aiming at the discovery of visual similarities in the contained objects in analo- 
gous pages. The fundamental observation is that the standardization in the generation 
tools of web pages has created consistencies in the style of headings, records and text 
blocks of the same category. Unfortunately, there are many ways in HTML to obtain 
the same effect in terms of font, color and style (tag names, order of tag, use of inline, 
internal or external CSS styles, etc.). So clearly similarities in the final effect may 
well correspond to differences in the underlying code, which adds a further layer of 
complexity in the process. Also [21] propose a method for classification of elements 
in a web page based on their presentation, nature and richness in style information. 

In [5] the repurposing of web content for PDA devices is obtained by determining 
the criteria for identify the elements of the page that constitute a content unit. The 
process is iterative, starting from a single block as wide as the whole page, and then 
progressively determining sub-areas within the areas already determined, while in [9] 
the reformatting of the web content for smaller PDA screens is performed through the 
filtering of speeific nodes of the DOM tree and leaving only relevant nodes. Another 
contribution to the reformatting of web pages for small screen is [16] 



3 IsaWiki: A Collaborative Editing Environment 

The World Wide Web is a powerful means to publish information and provide ser- 
vices but it lacks several hypertext functionalities [4]. In particular, there is still a 
strong difference between the roles of the author and the reader: readers use free and 
easy tools but can only choose reading paths explicitly provided for by the authors 
and cannot create new content, links or personal variants of web pages; on the other 
hand authors need to master a large number of different technologies, and/or to use 
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complex and expensive tools. Although web pages can be created with simple tools 
(such as by saving text content as HTML in Microsoft Word), the result is certainly 
not satisfying, and good looking web pages still require the intervention of technical- 
savvy, competent professionals. Furthermore, all authors can only edit a restricted set 
of resources (the ones on which they have write permission). 

It is possible to realize a “writable Web” with current web technologies? The 
IsaWiki project set out to prove that these functionalities can be provided without 
revolutionizing the current architecture of the World Wide Web or introducing new 
standards and protocols [18]. 

Two solutions are recently gaining importance in the collaborative editing pano- 
rama: wikis and weblogs. Weblogs [3] are tools for editing (mostly based on web 
forms) and publishing of personal diaries addressed to individuals and small commu- 
nities. Although weblogs allow users to easily and fast edit web pages during the 
browsing activities, the collaboration framework they provide is incomplete: a weblog 
page is a sequence of posts ordered by date and time, rather than a page composed by 
personal interventions by different users, in different times, on different fragments of 
the document. More advanced functionalities are provided by wikis [7], collaborative 
tools for shared writing and browsing, allowing every reader to access and edit any 
page of the site, through simple web forms and a very intuitive text based syntax for 
special typographical effects. Wikis, too, show some relevant drawbacks towards 
reaching a full writable web: graphical sophistication is still too raw, a new syntax 
needs to be mastered and, above all, only pages already stored on the wiki-site can be 
edited. 

Indeed, in a universal editing environment personal interventions need to involve 
every web page on every web server, regardless of users’ writing permissions. Many 
researchers and professionals have proposed on-the-fly addition of contents and links 
to existent documents towards this goal: whenever a user accesses to a page and a 
customized version of this page does exist, the application retrieves the original 
document from the origin server (which remains unmodified) and enriches it with 
information stored in external databases. 

For instance, CritLink [19] was the first annotation system for the World Wide 
Web based on a proxy, followed by a lot of similar free or commercial projects such 
as Commentor [6], iMarkup [10], or the now-defunct eQuill and Thirdvoice. Recently 
the W3C has proposed a shared Web annotation system (and an underlying protocol 
to make it work), Annotea [12] based on the same schema: Annotea clients, such as 
Amaya [1] or Annozilla [2], request the annotations (expressed in RDF syntax) from 
pre-established Annotea servers and show them side by side with the original unmodi- 
fied documents. 

It is important to notice that annotations systems provide only a partial support for 
customization: a document, in fact, is only composed of two overlapping layers, the 
original document and the annotations layer that is attached over the main one. Thus 
readers cannot delete contents or re-organize the document’s structure, and further- 
more multiple versions management and backtracking cannot be created. 

These solutions meet only some of the requirements we believe are necessary in a 
full writable web environment. IsaWiki [8] is a web editing environment being devel- 
oped at the University of Bologna. IsaWiki allows every user to easily create, modify 
and above all customize web pages while browsing. Any web page can be edited even 
if no write permission on the resource has been granted. 
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IsaWiki users browse web pages normally through their browser (currently Internet 
Explorer only, but implementation for Mozilla is well under way). By activating the 
IsaWiki service, a sidebar would appear during the navigation. Whenever the re- 
quested URL changes and the actual document is downloaded from its origin server, 
the IsaWiki client would interrogate the IsaWiki server asking for modifications to the 
requested document. If a variant of the accessed page is present, the server would 
send it and it would be displayed in the browser instead of the original page. Note that 
the original resource remains unmodified on the origin server. 

By selecting the edit command in the IsaWiki sidebar, a content editor appears and 
allows the user to add, delete and modify the content of the page. Upon saving the 
changes, the modifications are sent to the server and made available to all further 
subscribers. Closing the editing session, the navigation behaves normally. Multiple 
versions of the same page can be kept and displayed individually, by selecting them in 
a version list always displayed in the IsaWiki sidebar. 

Editing is performed in place, directly within the web page and the browser being 
used. Recent browsers allow users to edit content through a few recently introduced 
properties of the page elements, such as contentEditable in Internet Explorer and 
designMode in Mozilla. This allows inline editing of the content within the actual 
page that hosts the original content, with the same styles and layouts. These modifica- 
tions affect the local copy of the document that need to be posted on a server (in our 
case, the IsaWiki server). Eigure 1 shows different editable blocks in a new IsaWiki 
document. 




Fig. 1. Some editable regions in a IsaWiki document 

An important aspect of IsaWiki is deciding which parts of the web page are to be- 
come editable: rather that allowing editing on the whole web page, in fact, IsaWiki 
tries to identify editable regions (content areas) and non-editable regions, usually 
empty or decorative o navigational areas. 

To perform this task, IsaWiki looks for regularities in the HTML code of the page. 
In fact, template-driven automatic page generation is often used by both HTML edi- 
tors and content management systems (CMS). Content and layout are separately 
stored on the CMS, and, whenever a user asks for a resource, the server automatically 
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merges them into a complete web page before delivery. For instance, IsaWiki actually 
started off a simple web-page creation tool called ISA [17], a system to easily create 
sophisticated web sites exploiting well-known interfaces such as MS Word for con- 
tent creation (having no knowledge of HTML or any other markup language) and 
Photoshop or other graphical tools for layout (without having great expertise but sim- 
ply drawing the layout and slicing it in different zones). 

IsaWiki knows all about ISA regions, and can determine their presence and role in 
the final web pages. Unfortunately, this is not true for all template-driven web pages, 
where the distinction between content and layout need to be deduced by careful ex- 
amination of the HTML code of the page. For this reason we have built the ellSA 
engine. Started off as a simple function within the client-side editor, ellSA has 
evolved to become a full-featured structural analysis tool for web pages, amenable to 
being applied in many different tasks. Within IsaWiki, it is used to determine what are 
the main areas in the web page, and to determine which of them are content areas, that 
are worth to be modified by the user through the IsaWiki editor. 



4 The ellSA Architecture and Language 

The ellSA engine (ellSA stands for Extraction of Layout Information via Structural 
Analysis) is an independent module of IsaWiki in charge of identifying content re- 
gions of a web page through an heuristic approach. The ellSA engine reads the docu- 
ment loaded by the browser and outputs a list of editable fragments that the IsaWiki 
application transforms in contentEditable or designMode elements in the DOM of the 
web page. ellSA determines these parts according to a number of typical layouts or 
layout details that are used consistently by page designers throughout the web: for 
example the positioning of the links, the organization of the frames, the location of 
the banners and so on. To do so, it relies on declarative rules that are expressed with 
a specific rule language, and then applied to the web page via a meta-stylesheet in 
XSLT. In section 5 we will expose and discuss some of these rules. Here we will 
describe the actual working of the ellSA engine itself. 

The distinction between the analysis engine and the actual rules is important: 
whichever rules we would end up choosing, and however detailed and precise our 
implementation of the rules would turn out, we were sure that in a few years, months 
possibly, new web styles and new coding techniques would come out that made our 
rules obsolete or of only limited usefulness. Thus we decided to create a generic 
evaluation engine that is fed with fresh declarative rules every time, so as to make it 
simple to keep the rules up-to-date. 

Furthermore, even if ellSA is primarily used for the determination of editable text 
areas within the IsaWiki editor, its rule-based approach allows for more and more 
detailed analysis of any kind of regularity in web pages. 

The execution cycle of the ellSA engine within IsaWiki is composed of three steps: 

• Cleansing of the HTML code (reduction to well-formedness): since the actual 
evaluation of the ellSA rules is performed through an XSLT meta-stylesheet, it 
is necessary to create an equivalent version of the HTML document that is at 
least well-formed. The ellSA parser works within IsaWiki, which is a sidebar 
within the browser, so that it is possible to make use of the HTML parser of the 
browser itself. 
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• Evaluation and applications of the rules on the HTML code: The ellSA en- 
gines then loads the rules file, as written in the ellSA rule language, and applies 
an XSLT stylesheet to it. The end result is an XSLT stylesheet that, applied to 
the source document, creates a list of elements that have a clearly identified se- 
mantic role (such as logo, footer, navigation bar, advertisement, content area, 
etc.). The syntax of ellSA rules is discussed in the following. 

• Application of the editable attributes to the relevant HTML fragments of 
the code. Once the list of content areas is determined, they are found again in 
the original document and are set as editable before the document is actually 
made available to the user for editing. 

Obviously, only the second step of the sequence does any actual analysis of the page, 
and constitutes the core of the ellSA engine. The other steps are necessary for the 
blending of the ellSA engine within the IsaWiki application. 

At the basis of the ellSA engine is a rule language for expressing regularities in the 
source HTML document. Fig 2 shows an example of the rule: 

<RULES identity=" f alse"> 

<RULE COntext="TH I TD"> 

<CALL name= " nodeTextExceptTabl e " > 

<ADD select=" . //text ( ) "/> 

<EXCEPT select=" . //TABLE//text 0 "/> 

<EXCEPT select=" . //FORM//text ( ) "/> 

<EXCEPT select=" . //OPTlON//text ( ) "/> 

<EXCEPT select=" . //TEXTAREA//text ( ) "/> 

<EXCEPT select=" . //LABEL//text 0 "/> 

<EXCEPT select=" . //BUTTON//text ( ) "/> 

</CALL> 

<CHECK> 

<WHENEVER test=" count ( $nodeTextExceptTable) =0 "> 
<SET attr= "bgcolor" >orange</SET> 

<SET attr="elISAtype">layout</SET> 

</WHENEVER> 

</CHECK> 

</RULE> 

<RULE> 

</RULE> 

</RULES> 

Fig. 2. A simple rule for checking empty layout cells 

The rule is applied in the context of TD and TH elements, i.e., whenever the ellSA 
engine examines one of these tags in the HTML document. The rule is composed of 
the definition of the variable “nodeTextExceptTable”, which contains a node 
set of all non-empty text nodes within table headers and table cells, and the rule that 
checks if the number of nodes in the variable is greater than 0. If this is the case, it 
sets two attributes to the context element: a background color and an ellSAtype 
attribute set to “layout”. The actual meaning of this rule is to identify empty layout 
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cells as table elements that have no text content, but may have form elements or other 
nested tables inside. 

All rules are inserted in <RULE> elements, which define their context, i.e., the 
scope within which the rule has to be applied. Each context is specified by an XPath, 
which could be as simple as the name of an element or as complex as a full and com- 
plex tree-traversal specification. Within the <RULE> element there are as many 
<WHENEVER> elements as checks needing to be performed, each specifying by 
means of a boolean XPath a condition to check on the context elements. 
<WHENEVER> elements can nest, to perform additional subchecks, and can also con- 
tain specifications for the modification of the source HTML document. In this case, 
two new attributes are set to the context elements, but it is also possible to modify 
existing attributes, or to output new content within the context. 

Variables are useful to store and access with a simple name the node sets whose 
determination may require long and convoluted XPath expressions and that are used 
frequently in the rule. <CALL> elements are used to define new variables, which may 
appear both globally (outside of <RULE> elements) and locally (inside of <RULE> 
elements). Since quite often the formula to select the nodes of the variable are com- 
plex and lengthy, and are composed of a positive formula and one or more exceptions, 
the <CALL> element contains two subelements, <ADD> and <EXCEPT>, to specify 
positive formulas and exceptions separately. Each, of course, requires an XPath ex- 
pression. 

<xsl : template match="TD | TH"> 

<xsl : variable name="nodeTextExceptTable" 
select=" . //text ( ) and not ( ( . //TABLE//text ( ) ) or 
( . //FORM//text 0 ) or ( . //OPTION//text 0 ) or 
( . //TEXTAREA//text ( ) ) or ( . //LABEL//text ( ) ) or 
( . //BUTTON/ /text 0 ) "/> 

<xsl : copY> 

<xsl : if test=" count ( $nodeTextExceptTable) =0 "> 

<xsl : attribute 

name="bgcolor">orange</xsl : attribute> 

<xsl : attribute 

name="elISAtype">layout</xsl : attribute> 

</xsl : if> 

<xsl : apply- templates/> 

</xsl : copy> 

</xsl : template> 

Fig. 3. The XSLT template corresponding to the ellSA rule in fig. 2 

The ellSA rule language also contains a few additional functions that can be used 
in the XPath formulas, such as nchar ( $var ) , which returns the number of charac- 
ters in all the strings contained in $var, or Ic ($var) and uc ($var) that convert 
the text in $var in lowercase (respectively uppercase), or dif f erentChar 
($var, $char) , that is true if in $var there is at least one character not contained 
in $char. The use of some of these functions will be shown in section 5. 

The simplicity of the ellSA language and the systematic use of XPath expressions 
lead us to implement the analysis as a transformation of the source document accord- 
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ing to an XSLT stylesheet. In fact each <RULE> of the document can be transformed 
rather easily into an XSLT template (see fig. 3 for the corresponding template). 

Thus the ellSA engine is simply a converter that creates an XSLT stylesheet out of 
the rule set, and then applies it to the source document. The conversion is performed 
through a meta-stylesheet that creates XSLT templates out of the ellSA rules. 

The final output of the transformation is composed only of those elements that 
matched one of the rule contexts, thus making a list of elements with an ellSA attrib- 
ute specifying their role. But ellSA can be used to output the very same input docu- 
ment with added attributes and content. The root element of the ellSA rule language is 
the <RULES> element, which has an identity attribute. When identity is 
false, only the matching elements are put out, thereby creating a simple list of match- 
ing elements. When identity is true, on the other hand, an identity template is 
activated, and every single node (element, attribute, text, comment, etc.) of the input 
document is put in the output document, with the additional attributes specified in the 
actual rules, of course. 

In the example shown in fig. 2 and 3, besides setting the ellSAtype attribute that 
is used by the IsaWiki application, the ellSA rule also sets a color for the background 
of the cell. This attribute is specified for demonstration purposes only, to verify that 
the ellSA application is working correctly. To debug the rule set, in fact, we also 
specify some kind of visual effect (that would disappear in the final rule set) for ele- 
ments we are setting an ellSAtype value on. Some example of colored pages are 
given in the web site accompanying the conference. 



5 Determining Editable Regions in Web Pages with ellSA 

In this section we first summarize a few important rules that we are actually applying 
to web pages in the context of IsaWiki, in order to determine the content areas to be 
made editable by the IsaWiki editor, and then we provide a few intermediate experi- 
mental results on how well these rules are performing for the set of pages we have 
chosen for our tests. 



5.1 Rules for Textual Content Detection 

Although the whole rule set is rather long and convoluted, we show here only the 
rules we have created for table cells and headers (context="TH | TD"). In the ex- 
amples we have also left the color coding in the rules for clarity. 

The first and foremost check is the determination of empty layout cells. This is the 
<WHENEVER> element shown in fig. 2, and it will not be repeated here. Within the 
same TH | TD context, though, many additional <WHENEVER> checks are being 
performed. We put in the nodeText variable those text nodes that do not contain links, 
and in nodeLink all links that do not contain tables or form elements. See fig. 4 for 
the actual ellSA code. 

If there are both link elements and text elements, then we apply a rule described in 
[9] that requires computing the ratio between the number of links and the number of 
non-link words in the context. By number of non-link words we consider the number 
of non-link characters divided by the average length of word, which is set to five. The 
threshold for determining whether a table cell is a navigation element is set to 0.25 
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<CALL name="nodeText"> 

<ADD select=" $nodeTextExceptTable" /> 

<EXCEPT select=" . //A [Ohref ] //text () "/> 

</CALL> 

<CALL name="nodeLink"> 

<ADD select=" . / /A [Ohref] / /text ( ) "/> 

<EXCEPT select=" . //TABLE//text 0 "/> 

<EXCEPT select=" . //EORM//text () "/> 

<EXCEPT select=" . //OPTlON//text () "/> 

<EXCEPT select=" . //TEXTAREA//text 0 "/> 

<EXCEPT select=" . //LABEL//text 0 "/> 

<EXCEPT select=" . //BUTTON//text () "/> 

</CALL> 

Fig. 4. Two variable definitions for table cells 

(actually in our experiments, as discussed in section 5.2, we have preferred to use 
0.30 as the threshold value, for it is giving better results in our opinion). This means 
that whenever the number of links divided by the number of words in a context is 
greater than 0.25, the context is supposed to be a navigational element, otherwise a 
content area. All formulas and values are taken from [9]. This corresponds to the 
ellSA fragment shown in fig. 5. 

<WHENEVER test= " count ( $nodeLink) >0 and count ( $nodeText) >0 "> 
<WHENEVER test= 

"count ($nodeLink) div (nchar ($nodeText) div 5) > 0.25"> 
<SET attr="bgcolor">pink</SET> 

<SET attr="elISAtype">link</SET> 

</WHENEVER> 

<WHENEVER test= 

"count ($nodeLink) div (nchar ($nodeText) div 5) < 0.25"> 
<SET attr="bgcolor">cyan</SET> 

<SET attr="elISAtype">text</SET> 

</WHENEVER> 

</WHENEVER> 

Fig. 5. The mles for classifying navigation elements and text areas 

If the entirety of the content of the context is a link element, then this is definitely a 
navigational element, as is marked appropriately. Note that the background color goes 
from pink (quite sure) in the rule in fig. 5 to red (really sure) in the fragment in fig. 6. 

<WHENEVER test= " count ( $nodeLink) >0 and count ( $nodeText) =0 "> 
<SET attr="bgcolor ">red</SET> 

<SET attr="elISAtype">link</SET> 

</WHENEVER> 



Fig. 6. The rule for classifying link-only elements 
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Finally, contexts without link elements can be either non-empty cells, in which 
case they are surely content areas (shown in green), or empty cells, in which case they 
are surely layout cells (shown in brown). This corresponds to the rule in fig. 7. 

<WHENEVER test= " count ( $nodeLink) =0 and count ( $nodeText) >0 "> 
<WHENEVER tes t= "not ( dif f erentChar ( $nodeText , ' ' ) ) "> 

<SET attr="bgcolor">brown</SET> 

<SET attr="ellSAtYpe">layout</ SET> 

</WHENEVER> 

<WHENEVER test= " dif f erentChar ( $nodeText , ' ' ) "> 

<SET attr="bgcolor ">green</SET> 

<SET attr="elISAtype">text</ SET> 

<WHENEVER tes t= "nchar ( $nodeText ) < 20"> 

<SET attr="bgcolor ">yellow</ SET> 

<SET attr="elISAtype">layout</ SET> 

</WHENEVER> 

</WHENEVER> 

</WHENEVER> 



Fig. 7. The rules for classifying text-only cells 



5.2 Experimental Results with ellSA Rules 

As mentioned, the ellSA application can be used within the IsaWiki application for 
client-side web editing, as well as an independent and autonomous tool. The complete 
list of rules we are working on for the ellSA application thus is mostly aimed at sup- 
porting the IsaWiki requirements, and to a lesser extent for general web page analysis. 

We have run and are still running a series of tests to verify the correctness and so- 
lidity of the rules and the overall application we are working on. A first test was per- 
formed on almost 100 real web pages (in Italian and English), that have proved the 
overall trustworthiness of the ellSA system. More pages are being checked weekly, as 
we plan to reach a number of 3/500 different web pages from all over the world. 

Our tests have used a set of 19 rules aimed at identifying six different sections that 
are often found in real web pages, including content zones, layout zones (parts of the 
page without content, introduced as spacers between other zones), navigation zones 
(meant to contain navigational links within the site, rather than actual content), forms, 
logos and footers. Not all pages have identifiable parts of the page containing these 
elements, and many pages have many of each. 

The experimental results are promising. As shown in fig. 8, the most important ele- 
ments for the IsaWiki application are identified correctly. 





page part 


present 


identified 


Ignored 


success 
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content 


1262 


1262 


0 


100.00 % 


2 


layout 


5404 


5305 


99 


98.17 % 


3 


navigation 


2201 


2198 


3 


99.86 % 
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form 


147 


147 


0 


100.00 % 


5 


footer 


25 


15 


10 


60.00 % 


6 


logo 


141 


44 


97 


31.20 % 



Fig. 8. Experimental results with 97 web pages 
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The low results for the identification of footers and logos depend equally on our 
lack of consideration, so far, for those parts (we only included a few very basic rules), 
and on the number of different ways in which those zones are coded in real web 
pages. 

Of course these are only preliminary results. The rules for footers and logos can be 
easily extended to improve the overall success rate, and the rules for content zones 
can be refined to exclude smaller text elements of no relevance to content. Further- 
more, new categories for page parts can be devised and new rules can be built for 
their identification. In the next months we plan in fact to extend the set of rules used 
by the ellSA application. More details about our experiments can be found at the URL 
http://tesi.fabio.web.cs.unibo.it/elisa. 



6 Conclusions 



In this paper we have proposed a new application field for structural analysis of web 
pages, i.e., the identification of content areas for in-place editing of web pages, as 
required, for instance, by the IsaWiki system. 

For this reason we have created a rule-based analysis tool called ellSA, that can 
identify content areas by applying rules expressed in a simple XML-based language. 
The whole architecture is very simple, revolving around an XSLT transformation 
whose rules are created on the fly by applying a metastylesheet on the ellSA rules. 
One of the advantages of this approach is that the procedural part of the whole system 
is minimal (the cleansing of the web page, mostly) and that it can be ported with very 
little effort to a number of different operating environments. 

The ellSA rule language is simple and yet powerful, for it allows the expression of 
a large number of cases and patterns by means of the XPath pattern language. This 
also provides an incredible flexibility and power to the application with very limited 
implementation efforts. 

The development of the ellSA application is not finished yet. Besides adding a few 
more rules and polishing the syntax, we plan in the future to test our rules on a large 
number of real web pages, trying to fine tune the rules and add the odd new ones that 
would make manifest in the course of the test. Updated and complete information 
about ellSA can be found at http://tesi.fabio.web.cs.unibo.it/elisa. 
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1 Introduction 

With the ubiquity of the Web, the volume of Web documents continues to grow at a 
rapid speed. Since the Web is a vast source of information, extracting useful informa- 
tion from Web documents is important. 

HTML (Hypertext Markup Language), which is a format for visual rendering of 
Web documents, defines tag <TABLE> for representation of a table. On the other 
hand, most of the existing HTML documents use <TABLE> tags to present a format- 
ting layout of a document. As a prerequisite for information extraction from the Web, 
it is required to determine whether <TABLE> tags are used to present genuine tables 
or not. 

Generally, a table is a facility for presenting relational information structurally and 
concisely. This paper defines a table as an array of relational data. Specifically, we 
regard a table that relates an attribute and its value, as a genuine table as reported in 
previous works. In this paper, set of attribute cells and set of value cells are defined as 
an attribute area and a value area, respectively. 

Most previous works concerning table identification in HTML documents are 
based on a specific domain or take a lot of training data and time. This paper presents 
an efficient method for identifying tables in HTML documents prior to extracting 
information from the Web. 

2 Table Detection 

The proposed table detection method consists of two phases as shown in Fig. 1 : pre- 
processing and attribute-value relations extraction. Additionally, the relations extrac- 
tion consists of three steps: area segmentation, checking a syntactic coherency of a 
value area, and checking a semantic coherency between attribute and value areas. 

2.1 Preprocessing 

The proposed preprocessing detects tables based on general characteristics of genuine 
or non-genuine tables. For this purpose, eight rules were devised based on a careful 
examination of general characteristics of <TABLE> tags as shown in Fig. 2. 

2.2 Relations Extraction 

The relations extraction deals with the <TABLE> tags that are undetected at the pre- 
processing step. As mentioned before, the relations extraction method consists of 
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Ke3Words 

Patterns 




> 




• Detects a part of genuine or non-genuine 
tables based on rules. 



• Area segmentation 

• Syntactic coherency check in a value area 

• Semantic coherency check between attribute 
and value areas 




Fig. 1. The proposed table detection process. 



Rule (I )- IK: 
Rule (2) -IF: 
Rule (.1)- IF: 
Rule (4) -IF: 
Rule (5) -IF: 
Rule (6)- IF: 
Rule (7) -IF: 



A <C'AI’ri()N> tag exists, 
rhe size ol'a table is 1x1. 

The table dtK's not have any characters. 
Most cells incltide only hyperlink. 

Most cells consist of images. 

Most cells are empty. 



rilFN: The table is a genuine table. 

1 IlFN: The table is a non-genuine table. 
TllliN: the table is a non-genuine table. 
THEN: The table is a non-genuine table. 
THEN: The table is a non-genuine table. 
TllliN: The table is a non-gcnuinc table. 



The table includes <th> tags AND Existence of<td> lags at the right or down sides. 

rilEN: rhe table is a genuine table. 

Rule (8) - IF: A <TABLE> lag includes nested <TABEE> tags. 

I HEN: I'he table is a non-genuine table. 



Fig. 2. The proposed preprocessing rules. 



three parts: area segmentation, syntactic coherency checkup, and semantic coherency 
checkup. Detailed explanation of the proposed method is given below. 

Area Segmentation. First of all, the proposed method segments a table area into 
attribute and value areas. If the size of a table is 1x2 (and Ixn) or 2x1 (and nxl), the 
first column or row corresponds to an attribute area. If the size of a table is 2x2 and 
do not have any span, the first row (or column) would indicate an attribute area, there 
by making the second row (or column) a value area. For tables of size 1x2, 2x1, or 
2x2, the proposed method cannot check for syntactic coherency in a value area be- 
cause there exists only one value for each attribute. The method checks for semantic 
coherency between attribute and value area. Specifically, for 2x2 tables with no span, 
existence of a semantic coherency should be checked by rows or columns. On the 
other hand, a table with the size of Ixn (or nxl), the method checks for syntactic 
coherency by rows (or columns). Tables with the size of 2xn or nx2 (n > 2) may not 
have any value area. For example, if the first row or column in a table consists of one 
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cell, the proposed method regards the table as a non-genuine table because no value 
area exists. The proposed method considers the first row with single cell and the next 
row as main attribute area and sub-attribute area, respectively. In the case of a 2xn 
table where the first row spans not whole columns but more than one column, then if 
the first row were an attribute area, the next row would be a sub-attribute. Thus, the 
proposed method regards the first column and the remaining columns as an attribute 
area and value areas, respectively. For the rest of 2xn (or nx2) tables, it is possible to 
apply the syntactic coherency checkup by rows (or columns) and the semantic coher- 
ency checkup in both directions. For example, if the proposed method regards the first 
column (or row) as an attribute area, the method can check for syntactic coherency 
and semantic coherency horizontally (or vertically). If the proposed method regards 
first row (or column) as an attribute area, then the method just examines for semantic 
coherency vertically (or horizontally). Furthermore, we segment tables, whose size is 
greater than or equal to 3x3, into attribute and value areas like 2xn (and nx2) tables. 
Detailed description on the coherency checkup of segmented areas is as follows. 

Syntactic Coherency Checkup. If a table has more than one value for each attribute 
row-wise (or column-wise), the proposed method examines the values existence of 
row-wise syntactic coherency (or column-wise). For this purpose, we define "row (or 
column)-wise coherency" as the average value of the coherency rows (or columns) 
that compose a value area as follows: 

Row (or column)-wise coherency = y, row (or column) coherency (1) 

The number of rows (or columns) 

where the row (or column) coherency is computed by, 

Row (or column) coherency= W,xdata type coherency+ W^xlength coherency. (2) 

The data type coherency of Equation (2) is defined by. 

Data type coherency = number of cells that share a major data type (3) 

The total number of cells in a row (or a column) 

where "major data type" means the data type with the highest frequency in the con- 
stituent cell of the corresponding row or column. The proposed method first sets the 
type of a data cell as image type or form type according to the kind of tags used in the 
cell. For the remaining cells, the proposed method determines their data types based 
on their contents. The proposed method defines 15 data types. Specifically, the textual 
patterns and keywords are used to recognize some of these data types. 

The length coherency means how similar the lengths of the constituent cells are. It 
is denoted as the frequency of cells of which the length is within the range of oc, the 
average length of cellsxO.5 < <>: < the average length of cellsxl.5, and is computed by. 

Length coherency = number of cells with a length within the range a 
The total number of cells in a row (or a column) 

If it is possible to check coherency for a table in both horizontal and vertical direc- 
tion, the method selects the larger value as the coherency value of the table. If the 
value of a table coherency is larger than a predefined threshold, ThCoherency, the 
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table is deemed to be a genuine table. Otherwise, the proposed method apply the se- 
mantic coherency checkup to the table. 

Semantic Coherency Checkup. If a table is found to be inappropriate for the syntac- 
tic coherency checkup or if its table coherency value is smaller than ThCoherency, 
then the proposed method examines whether there exists any semantic match between 
an attribute area and its value area based on auxiliary information. The auxiliary in- 
formation includes keywords or textual patterns, which correspond to values of spe- 
cific attributes, as well as data types of value areas. The method regards <TABLE> 
tags with semantic relations between attribute and value areas as genuine tables. For 
instance, a table, in which an attribute ‘e-mail’ has the values with the pattern of 
“\wH-\@\w-i-\.\wH-”, is regarded as a genuine table. 



3 Experimental Results 

To evaluate the performance of the proposed method, we experimented with 11,477 
<TABLE> tags from 1,393 HTML documents. For comparison, we used the same 
data as in Wang and Hu's method [1]. The <TABLE> tags consist of 1,675 genuine 
tables and 9,802 non-genuine tables. The threshold of a table coherency, ThCoher- 
ency, the weight of data type coherency, and the weight of length coherency were 
experimentally determined as 0.54, 0.6 and 0.4, respectively. Experimental results 
show that the method has performed better compared with previous works, resulting 
in a precision of 97.54% and a recall of 99.22%. Particularly, unlike some conven- 
tional methods, this method does not need to be trained on a large volume of data in 
advance. Meanwhile, the proposed method could not detect non-genuine tables with a 
syntactic coherency or genuine tables with many empty cells. 
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Abstract. Most e-mail readers spend a non-trivial amount of time regularly de- 
leting junk e-mail (spam) messages, even as an expanding volume of such 
e-mail occupies server storage space and consumes network bandwidth. An on- 
going challenge, therefore, rests within the development and refinement of 
automatic classifiers that can distinguish legitimate e-mail from spam. A few 
published studies have examined spam detectors using Naive Bayesian ap- 
proaches and large feature sets of binary attributes that determine the existence 
of common keywords in spam, and many commercial applications also use Na- 
ive Bayesian techniques. Spammers recognize these attempts to thwart their 
messages and have developed tactics to circumvent these filters, but these eva- 
sive tactics are themselves patterns that human readers can often identify 
quickly. Therefore, in contrast to earlier approaches, our feature set uses de- 
scriptive characteristics of words and messages similar to those that a human 
reader would use to identify spam. This preliminary study tests this alternative 
approach using a neural network (NN) classifier on a corpus of e-mail messages 
from one user. The results of this study are compared to previous spam detectors 
that have used Naive Bayesian classifiers. Also, it appears that commercial 
spam detectors are now beginning to use descriptive features as proposed here. 



1 Introduction 

The volume of junk e-mail (spam) transmitted by the Internet has arguably reached 
epidemic proportions. While the inconvenience of spam is not new - public comments 
about unwanted e-mail messages identified the problem as early as 1975 - the volume 
of unsolicited commercial e-mail was relatively limited until the mid-1990s [3]. Spam 
volume was estimated to be merely 8% of network e-mail traffic in 2001 but has bal- 
looned to about 40% of e-mail today. One research firm has predicted that the cost of 
fighting spam across the U.S. will approach $10 billion in 2003 [14]. 

Most e-mail readers must spend a non-trivial amount of time regularly deleting 
spam messages, even as an expanding volume of junk e-mail occupies server storage 
space and consumes network bandwidth. An ongoing challenge, therefore, rests within 
the development and refinement of automatic classifiers that can distinguish legitimate 
e-mail from spam. 
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Many commercial and open-source products exist to accommodate the growing 
need for spam classifiers, and a variety of techniques have been developed and applied 
toward the problem, both at the network and user levels. The simplest and most com- 
mon approaches are to use filters that screen messages based upon the presence of 
words or phrases common to junk e-mail. Other simplistic approaches include black- 
listing (automatic rejection of messages received from the addresses of known spam- 
mers) and whitelisting (automatic acceptance of message received from known and 
trusted correspondents). In practice, effective spam filtering uses a combination of 
these three techniques. The primary flaw in the first two approaches is that they rely 
upon complacence by the spammers by assuming that they are not likely to change (or 
forge) their identities or to alter the style and vocabulary of their sales pitches. The 
third approach, whitelisting, risks the possibility that the recipient will miss legitimate 
e-mail from a known or expected correspondent with a heretofore unknown address, 
such as correspondence from a long-lost friend, or a purchase confirmation pertaining 
to a transaction with an online retailer. 

A variety of text classifiers have been investigated that categorize documents topi- 
cally or thematically, including probabilistic, decision tree, rule-based, example-based 
(“lazy learner”), linear discriminant analysis, regression, support vector machine, and 
neural network approaches [10]. A prototype system has also been designed to recog- 
nize hostile messages (“flames”) within online communications [11]. However, the 
body of published academic work specific to spam filtering and classification is lim- 
ited. This may seem surprising given the obvious need for effective, automated classi- 
fiers, but it suggests two likely reasons for the low volume of published material. First, 
the effectiveness of any given anti-spam technique can be seriously compromised by 
the public revelation of the technique since spammers are aggressive and adaptable. 
Second, recent variations of Naive Bayesian classifiers have demonstrated high de- 
grees of success. In general, these classifiers identify attributes (usually keywords or 
phrases common to spam) that are assigned probabilities by the classifier. The product 
of the probabilities of each attribute within a message is compared to a predefined 
threshold, and the messages with products exceeding the threshold are classified as 
spam. 

Sahami, et al. [9] proposed a Naive Bayesian approach that examined manually- 
categorized messages for a set of common words, phrases (“be over 21”, “only $”, 
etc.), and non-textual characteristics (such as the time of initial transmission or the 
existence of attachments) deemed common to junk e-mail. Androutsopoulos, et al. [1] 
used an edited', encrypted, and manually-categorized corpus of messages with a lem- 
matizer and a stop-list, using words-attributes. Both approaches used binary attributes, 
where = 1 if a property is represented and = 0 if it is not. In each case, the se- 
lected words were the result of hand-crafted, manually-derived selections. In addition 
to these approaches, several applied solutions exist that claim high success rates (as 
high as 99.5%) with Naive Bayesian classifiers [2], [5], [6], [7], [8] that use compre- 



' The corpus utilized by [1] removed all HTML tags and attachments, and all header fields 
other than “Subject:” were removed for privacy reasons. 
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hensive hash tables comprised of hundreds of thousands of tokens and their corre- 
sponding probability values, essentially creating attribute sets of indefinite size^. 

While these Naive Bayesian approaches generally perform effectively, they suffer 
from two intrinsic problems. The first is that they rely upon a consistent vocabulary by 
the spammers. New words that become more frequently used must be identified as 
they appear in waves of new spam, and, in the case of hash tables, any new word must 
be assigned an initial arbitrary probability value when it is created. Spammers use this 
flaw to their advantage, peppering spam with strings of random characters to slip the 
junk messages under the classification thresholds. The second problem is one of con- 
text. Binary word-attributes, and even phrase-attributes, do not identify the common 
patterns in spam that humans can easily and readily identify, such as unusual spellings, 
images and hyperlinks, and patterns typically hidden from the recipient, such as 
HTML comments. 

In summary. Naive Bayesian classifiers are indeed naive, and require substantial 
calculations for each e-mail classification. A human reader, by contrast, requires rela- 
tively little calculation to deduce if a given e-mail is a legitimate message or spam. 
While spammers send messages that vary widely in composition, subject, and style, 
they typically include identifiable tactics that are designed to garner attention or to 
circumvent filters and classifiers and that are rarely used in traditional private corre- 
spondence. These evasive tactics are themselves patterns that human readers can often 
identify quickly. 

In this paper we apply a neural network (NN) approach to the classification of spam 
using attributes comprised from descriptive characteristics of the evasive patterns that 
spammers employ, rather than the context or frequency of keywords in the messages. 
This approach produces similar results but with fewer attributes than the Naive Bayes- 
ian strategies. 



2 Methodology 

This project used a corpus of 1654 e-mails received by one of the authors over a pe- 
riod of several months. None of the e-mails contained embedded attachments. Each 
e-mail message was saved as a text file, and then parsed to identify each header ele- 
ment (such as Received : or Sub j ect : ) to distinguish them from the body of the 
message. Every substring within the subject header and the message body that was 
delimited by white space was considered to be a token, and an alphabetic word was 
defined as a token delimited by white space that contains only English alphabetic 



^ A token is a “word” separated by some predetermined delimiter (spaces, punctuation, HTML 
tags, etc.), and therefore a given token many not necessarily correspond to an actual word of 
written text. Examples from [5] include “qvp0045”, “freeyankeedom”, “unsecured”, and 
“7c266675”, among others. In [6], Graham argues that performance may be improved by pro- 
viding separate case-sensitive entries for words in a hash table (such as “FREE!!!” and 
“Free! ! !”), potentially magnifying the size of the probability table. The selection of delimiters 
and the effectiveness of scanning HTML tags for tokens are currently subjects of debate. 
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Table 1. Features extracted from each e-mail. 





Features From the Message Subject Fleader 


1 


Number of alphabetic words that did not contain any vowels 


2 


Number of alphabetic words that contained at least two of the following letters (upper 
or lower case): J, K, Q, X, Z 


3 


Number of alphabetic words that were at least 15 characters long 


4 


Number of tokens that contained non-English characters, special characters such as 
punctuation, or numeric digits at the beginning or middle of the token. 


5 


Number of words with all alphabetic characters in upper case 


6 


Binary feature indicating occurrence of a character (including spaces) that is repeated 
at least three times in succession: yes = 1, no = 0 




Features From the Priority and Content-Type Headers 


7 


Binary feature indicating whether a priority header appeared within the message 
headers (X- Priority and/or X-MSMail -priority) or whether the priority had 
been set to any level besides normal or medium: yes = 1, no = 0 


8 


Binary feature indicating whether a content - type header appeared within the 
message headers or whether the content type of the message has been set to 
“text/html”: yes = 1, no = 0 




Features From the Message Body 


9 


Proportion (fraction) of alphabetic words with no vowels and at least seven characters 


10 


Proportion of alphabetic words that contained at least two of the following letters in 
upper or lower case: J, K, Q, X, Z 


11 


Proportion of alphabetic words that were at least 15 characters long 


12 


Binary feature indicating whether the white-space-delimited strings “From:” and 
“To:” were both present: 1 = yes, 0 = no 


13 


Number of HTML opening comment tags 


14 


Number of hyperlinks (“href=”) 


15 


Number of clickable images represented in the HTML 


16 


Binary feature indicating whether a color of any text within the body message was set 
to white: 1 = yes, 0 = no 


17 


Number of URLs within hyperlinks that contain any numeric digits or any of three 
special characters (“&”, or “@”) in the domain or subdomain(s) of the link 



characters (A-Z, a-z) or apostrophes. The tokens were evaluated to create a set of 17 
hand-crafted features from each e-mail message (Table 1). 

The e-mails were manually categorized into 800 legitimate e-mails and 854 junk 
e-mails. Half of each category was randomly selected to comprise a training set 
(n = 400 + 427 = 827) and the remaining e-mails were used as a testing set. All feature 
values were scaled (normalized) to range from 0 to 1 to serve as input units in the 
neural network. In particular, the count features (“Number of...”) were normalized by 
the largest counts of those features in the database. 

The training data were used to train a three-layer, backpropagation neural network. 
Each of the 17 input units (features) had values in the range 0 to 1. The number of 
hidden units ranged from 4 to 14. The single output unit was trained to 1 for spam and 
to 0 (zero) for legitimate e-mail messages, and the number of training epochs ranged 
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from 100 to 500. After training, the e-mail messages of the testing set were classified 
to obtain generalization accuracy results. 

3 Results 

The relative success of spam filtering techniques is determined by classic measures of 
precision and recall on the testing subsets of legitimate e-mail and junk e-mail. Spam 
precision (SP) is defined as the percentage of messages classified as spam that actually 
are spam. Likewise, legitimate precision (LP) is the percentage of messages classified 
as legitimate that are indeed legitimate. Spam recall (SR) is defined as the proportion 
of the number of correctly-classified spam messages to the number of messages origi- 
nally categorized as spam. Similarly, legitimate recall (LR) is the proportion of cor- 
rectly-classified legitimate messages to the number of messages originally categorized 
as legitimate. Thus, we define the counts, and the four precision and recall formulas; 

= the number of spam messages correctly classified as spam 

= the number of spam messages incorrectly classified as legitimate 

17^ = the number of legitimate messages correctly classified as legitimate 

= the number of legitimate messages incorrectly classified as spam 



Tl 

Spam precision (SP) = 


(1) 


^LS 


fl 

Legitimate precision (LP) — — — 

^LL 


(2) 


fl 

Spam recall (SR) = 

^SS ^SL 


(3) 


fl 

Legitimate recall (LR) = — — 

^LL + ^LS 


(4) 



Table 2 gives the results on the testing set by hidden node count and training ep- 
ochs. The trial with 12 hidden nodes and 500 epochs (highlighted in the table) pro- 
duced the lowest number of misclassifications, with 35 of the 427 spam messages 
(8.20%) classified as legitimate and 32 of the 400 legitimate messages (8.00%) 
classified as spam (n^), for a total of 67 misclassifications. Perhaps the most impor- 
tant success measure, however, is Legitimate Recall (LR), and the best LR value was 
with 8 hidden units and 500 training epochs, second best with 12 hidden units and 300 
epochs. 

For the highlighted case, of the 35 misclassified spam messages, 30 were short in 
length - only a few lines, including HTML tags - some as brief as “save up to 27% on 
gas” followed by a hyperlink. Among the remaining five messages: one had many 
“comments” without comment delimiters, thus creating nonsense HTML tags that 
some browsers ignore (but some do not - a risk this spammer was willing to take); two 
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were written almost entirely in ASCII escape codes; one followed four image files 
with English words in jumbled, meaningless sentences; and one creatively used an off- 
white color for fonts to disguise the random characters appended to the end of the 
e-mail. 



Table 2. Classification results on the testing set (n = 827). 



Hidden 

Nodes 


Training 

Epochs 


Spam 


Legitimate 


Precision 

(%) 


Recall 

(%) 


Precision 

(%) 


Recall 

(%) 




300 


91.81 


86.65 


86.56 


91.75 


8 


400 


90.95 


89.46 


88.94 


90.50 




500 


93.73 


87.59 


87.62 


93.75 




300 


92.11 


90.16 


89.73 


91.75 


10 


400 


91.09 


86.18 


86.05 


91.00 




500 


92.48 


86.42 


86.45 


92.50 




300 


93.52 


87.82 


87.79 


93.50 


12 


400 


91.73 


88.29 


87.98 


91.50 




500 


92.45 


91.80 


91.32 


92.00 




300 


91.58 


84.07 


84.37 


91.75 


14 


400 


92.04 


86.65 


86.59 


92.00 




500 


91.28 


88.29 


87.92 


91.00 



The 32 legitimate messages were misclassified due mostly to characteristics that 
are unusual for personal e-mail. Twenty-two affected the features normally triggered 
by spam: six were from a known correspondent that prefers to write in white typeface 
on a colored background, ten were responses or forwards that quoted HTML that trig- 
gered several features, five were commercial e-mail from known vendors (with many 
hyperlinks and linkable images), and one was ranked as “low” priority from a known 
correspondent. The remaining ten messages were less obvious: four included special 
characters or vowel-less words in the subject header, three had several words with 
multiple occurrences of rare English characters (feature 2), and three had an unusual 
number of hyperlinks (due, in part, to links in signature lines). 

The NN accuracy of this study is similar to that of the Naive Bayesian classifiers 
described in [1] and [9], and Table 3 presents a comparison. 

For comparison purposes we also ran a small experiment with spam blacklist data- 
bases. While some databases are part of commercial programs, most require manual 
entry of IP addresses one at a time, apparently designed primarily for mail server ad- 
ministrators who are trying to determine whether their legitimate e-mails are being 
incorrectly tagged as spam. To test how accurately legitimate and spam e-mails are 
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Table 3. Comparison results for NN and Naive Bayesian classifiers. 



Classifier 


Num 


Num 


Spam 


SP 


SR 


LP 


LR 




Feat 


Msgs 


(%) 


(%) 


(%) 


(%) 


(%) 


NN 

(12 nodes, 500 epochs) 


17 


827 


51.6 


92.5 


91.8 


91.3 


92.0 


Naive Bavesian from [91^ 
















Words 


500 


1789 


88.2 


97.1 


94.3 


87.7 


93.4 


Words+Phrases 


500 


1789 


88.2 


97.6 


94.3 


87.8 


94.7 


Words-i-Phrases-i-Non-textual 


500 


1789 


88.2 


100.0 


98.3 


96.2 


100.0 


Naive Bavesian from [11^ 
















Bare 


50 


1099 


43.8 


95.1 


84.0 


N/A 


N/A 


Stop-List 


50 


1099 


43.8 


96.8 


84.2 


N/A 


N/A 


Lemmatized 


100 


1099 


43.8 


98.3 


78.1 


N/A 


N/A 


Lemmatized + Stop List 


100 


1099 


43.8 


98.0 


79.6 


N/A 


N/A 



^ Sahami, et at. [9] used three feature sets in their approach. The first used keywords, the second considered additional key 
phrases, and the last included non-textual attributes. In each case the most prevalent 500 attributes within the corpus were 
selected. 

^ Androutsopoulos, et al. [1] used a corpus from a moderated mailing list in a “bare” form and with three forms of altera- 
tions: a lemmatized version (which changed parts of speech, such as changing “earned” to “earn”), a version edited with a 
stop-list (which removed frequently used words), and a version using both the lemmatizer and a stop-list. The authors did 
not provide statistics for legitimate precision (LP) or legitimate recall (LR). 



tagged by the blacklist databases, we manually entered the IP addresses of the e-mail 
messages that were incorrectly tagged by the NN classifier (32 legitimate and 35 spam 
e-mails) into a site that sends IP addresses to 173 working spam blacklists and returns 
the number of hits [4]. We entered both the first (original) IP address of each message 
and also, when present, a second IP address (a possible mail server or ISP). While it is 
likely that the second IP column includes bulk e-mail servers of spammers, it is also 
likely that it includes non-spamming ISPs or Web portals that route junk e-mail mes- 
sages but presumably do not participate intentionally in spamming. Because we con- 
sidered single-list hits to be anomalies since they are not confirmed by any other 
blacklists on the site, we counted only hit counts greater than one as spam that would 
have been blacklisted. The blacklisting results are presented in Table 4. While the 
percentages of legitimate e-mails considered spam by the blacklists are lower than the 
percentages of spam correctly identified as spam, it is surprising to see that over half 
were incorrectly screened using our “at least two blacklists” criterion. Even though we 
tested the blacklist databases with potentially difficult e-mails, the ones incorrectly 
classified by the NN classifier, the poor blacklisting results indicate that the blacklist- 
ing strategy, at least for these databases, is inadequate. 



4 Conclusion 



Although the descriptive-feature NN technique is accurate and useful, the spam preci- 
sion performance is not high enough for the technique to be used without supervision. 
For this technique to be more useful the feature set should be enhanced with additional 
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members or modifications. It is important to emphasize, however, that this approach 
requires many fewer features to achieve results comparable to the earlier reported 
Naive Bayesian approaches, indicating that descriptive qualities of words and mes- 
sages, similar to those used by human readers, can be used effectively to distinguish 
spam automatically. More importantly, a neural network classifier using these descrip- 
tive features may not degrade over time as rapidly as classifiers that rely upon a rela- 
tively static vocabulary from spammers. 



Table 4. Blacklisting results for the e-mails incorrectly tagged by the NN classifier. 



Classification 


Blacklisting (% Considered Spam) 


First IP Address 
(Original Address) 


Second IP Address 
(E-mail Server/ISP) 


Either First or Second 
IP Address 


^LS 

(32 legitimate) 


53.1 


25.0 


53.1 


nsL 

(35 spam) 


40.0 


60.0 


97.1 



Although the techniques used in commercial systems are typically proprietary and 
not described in the literature, some of these spam filter systems now appear to be 
using descriptive features [12], [13]. As suggested in previous work [9], a combination 
of keywords and descriptive characteristics may provide more accurate classification 
and this may be what is being done in the commercial spam filters. Strategies that 
apply a combination of techniques, such as a filter in conjunction with a whitelist, 
should also yield better results. Finally, it should be emphasized that spam detection 
algorithms must continually improve to meet the evolving strategies that spammers 
employ to avoid detection. 
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Abstract. Companies order, receive, and pay for goods. Hence they continually 
receive and process invoices. For the most part these are printed on paper and 
are dealt with manually, so that each invoice after receipt involves processing 
costs of about 9 Euro on average. Often, human searching and typing of data 
into computer forms is required to transfer the information from paper into the 
computer, e.g. into ERP-systems, like SAP, that many companies run. This arti- 
cle presents the main results of our 300-page market survey of 1 1 suppliers of 
invoice reading systems {I-R-S), which automate the transfer of invoice data to 
ERP-systems. For the scientific I-R-S community we hope to provide the ser- 
vice of a better visibility of our discipline to potential investors and users. 



1 Introduction 

This paper reports on a collection of selected results, worked out by about 15 people, 
mostly seniors in their fields, in a period of more than 12 months. It was a (controver- 
sial) decision not to take up room to explain the starting conditions and to limit the 
description of the (extensive) methodology to its core. (For this paper, this also im- 
plied a limitation of references, basically to our own work.) However, an interesting 
document analysis application field is introduced. Then those of our results are pre- 
sented, which we consider most promising and fruitful for further scientific treatment 
by document analysis researchers and others. 

It is natural that our results incorporate already a considerable amount of interpre- 
tation by us, even though this is not further explicated in the paper. This goes beyond 
the sketch of important parameters of the invoice reading scenario provided in the 
next section. Whether system demos and interviews bring interesting information 
about, depends very much on the individual preparation and alertness of the inter- 
viewers. However, we feel that with no further comments by us, the presented results 
call for interpretation by the reader. We selected those results, which we think to feed 
the thoughts of the interested reader and activate him to derive interesting, hopefully 
new conclusions; conclusions which we would perhaps not dare to formulate or could 
not even think of. 

Our consortium of four companies (their specific experiences are each in brackets) 
with experience in the I-R-S field, interim2000 (documents analysis marketing and 
consulting). Pylon (document process consulting), Integra (market analysis), and 
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DFKI (document analysis science, [1,2,3]), detected the emergence of a demand for 
invoice reading systems (J-^-5) on the German market. Eleven companies supply 
systems for invoice reading, see Table 1. However, the two issues of guidelines for a 
systematic elicitation of individual necessities and constraints of invoice reading 
problems, as well as a knowledgeable survey and comparison of the 11 available 
systems has remained completely open. In other words: 

• Asa business owner or his consultant: which system fits which exact problems? 

• Asa system supplier: what does your system lack to fit certain (sub-) problems? 

• As a researcher: what goes on outside the ivory tower? 

These issues are considered interesting for the Document Analysis and Understanding 
(DAU) discipline; given that the topic of I-R-S at hand, is a subset within DAU. The 
above consortium engaged in an approach to these issues, and prepared a study'. [6] 
The building blocks were 1) a market analysis (i.e., 70 telephone interviews with 
likely user companies), 2) a compilation of information asked from the companies on 
their products and success stories, 3) a detailed questionnaire used in a poll of the 
system suppliers, each completed with 4) an attendance system demonstration and 
interview. All the information is probably an interesting source for scientific consid- 
erations. 



Table 1. Suppliers and systems" 



Supplier 


Name of the system 


BasWare GmbH 


BasWare invoice processing 


Captiva 


InvoicePack 


DICOM Deutschland AG 


DICOM Invoice 123 


Docutec AG 


Docutec Xtract for Documents 


FreeFormatlon GmbH 


4invoice 


INSIDERS Technologies GmbH 


smartFIX INVOICE 


ITESOFT 


FreeMind fiir Invoices 


Kleindienst Solutions 


FrontCollect® Invoices 


Oce Document Technologies (ODT) 


DOKuStar V3.1 


Paradatec GmbH 


PROSAR-AIDA 


SER Solutions Deutschland GmbH 


SER InvoiceMaster 



" Two further suppliers, Futuresoft and TIS, appeal' in the study, however, could not make it to come and 
demonstrate their systems to us, due to project duties. As the demonstrations are the core basis of this 
paper, they are not listed in this article. 



This paper aims to present the results most informative to the document analysis 
systems community. In the remainder, first our preparatory modeling of the DAU 
application of invoice reading is reported. Then, very briefly, the most interesting 
numbers from the customer telephone interviews are enumerated. In a necessarily 
cautious business style the supplier companies and their systems are presented. Some 
important, generalized interpretations or “strategic disclosures” are reported thereafter 
and a short discussion is provided. 



The study is in German, in order to sell it to German companies to refinance the effort. 



1 
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2 Invoice Reading 

Companies order, receive, and pay for goods, be they basics for their business, or as 
trivial things as pencils, cleaner, computers, and services like daily office cleaning or 
hardware maintenance. Most companies run ERP-systems (“Enterprise Resource 
Planning”) - in Germany the market poll revealed that this is to 75% SAP - to store 
and organize their orders, delivery notes, invoices, invoice acknowledgements, and 
money transfers. Despite the availability of a standard like EDIEACT (supported by 
the United Nations) on the basis of which systems can be connected so that invoices 
can be sent and received electronically, it is current practice to send and receive in- 
voices and the like as letters, i.e., printed on paper sent via “snail-mail”. 

These invoices contain always rather similar information, fostered by legal re- 
quirements for information items on invoices. However, the information items are 
distributed according to all different layout styles. Information items are differentiated 
into a) head data, like invoicing party, legal invoice date, invoice number code, VAT, 
invoice amount, additional charges, and b) table data, comprising the invoice line 
items, the single positions accumulated by the invoice. All the single information 
items are needed for the downstream invoice processing, up to and including archiv- 
ing. 

Integration of Invoice Reading Systems 

In a department that processes invoices, “digitization” has two effects. It allows for I) 
an overview over all processes, progress and status, and for 2) support of the proc- 
esses. The effects of both are an improved quality and a higher speed, where both 
relate to money. Thus, clearly achievable goals are: 

• Strongly reduce the delivery times and the rate of unexplainably lost data, through 
the use of electronic media (especially effective, if sub-processes are spatially dis- 
tributed). 

• Overview the progress of all current processes and discover problems and delays 
(also automatically) through an automatic central bookkeeping. 

• Reduce human errors by supplying supportive information, automation of repeti- 
tive manual actions (e.g., applying notices of receipt), and supportive checking 
mechanisms (check for missing data, plausibility checks, etc.). 

However, I-R-S allows to support the processes even more deeply; one can penetrate 
the invoice data, reconcile them with company databases, perform all sorts of plausi- 
bility calculations and comparisons, immediately search or prepare search for missing 
data, thus significantly reducing the probability of human errors. The only precondi- 
tion is that the I-R-S must access the available information, i.e., it needs the technical 
interfacing modules and know-how to effectively exploit them. [5] Then, this creates 
exactly that freedom for the invoice personnel, to deal with those aspects of invoice 
processing which profit most from the genuine human power of cognition and creativ- 
ity. 

Verifier User Interface 

The first step is the digitization of the incoming invoices, with scanners and OCR. 
One part of the verification in an I-R-S is focused to assure that these steps went right. 
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Thus all systems provide a verifier user interface. Verifiers should be ergonomically 
prepared for the conditions of bulk processing. 

Template Versus Freeform Information Spotting 

Spotting an information item, e.g., finding the legal date of a letter, can either rely on 
how it looks like (freeform), or where it is (template). [4] Human cognition combines 
both strategies smartly. However, for the design of systems, a conscious distinction 
and application is crucial. 

Touchtyping on a keyboard is an analogy for template-based-ness, as the keys are 
hit by, relying on their learned positions. On the other hand, if one searches for the 
key with the label “A” to type an “A” and so forth, this is an analogy for freeform- 
based-ness, as the right key is recognized by its look alone. 

Whereas the freeform approach is slower due to the necessity to search all the la- 
bels/looks of the items again and again, the template approach is more brittle, as it 
relies on the match of learned positions of items and the actual search space. Not 
knowing Japanese, I cannot spot the street in an address on Japanese invoices free- 
form-based. I might be able to spot it template-based if the positions are like on Euro- 
pean invoices. However, if the positions on Japanese invoices are different, I may 
consider an item the street, which is not in fact the street. And I have no chance to 
know. 

For DAU in general one will almost always need to combine both strategies. How- 
ever, at J-^-5 one searches: 

• A finite set of information items 

• Information items with a distinct look 

• Always a similar set of information items (not differing very much from invoice to 
invoice) 

Consequently, at I-R-S the freeform approach is generally superior. In a number of 
practical cases a supplement with the template approach is worthwhile. And, cer- 
tainly, every softening of the above constraints requires more and more template- 
methods. 

Online ERP-Connection 

With an ERP-connection, an I-R-S can retrieve a list of all creditors (the invoicing 
party, i.e., the sender of the invoice), including their addresses, tax numbers, bank 
account numbers, etc. Then not only does it know how these information items gener- 
ally look like, but it can discern, which names and numbers can appear on the invoice. 
The matching is extremely accurate and fast. 

If the connection is online, i.e., accessible at all times, the I-R-S caa further retrieve 
a list of all those goods and services which were delivered to a given creditor and not 
paid to date. Then the invoice line items can be matched just as accurately and 
quickly. 

Classification and Page Collation 

Invoices most often span more than one page. (This was reassured by the market 
poll.) Systems should be able to intelligently collate pages to documents. Otherwise 
human support is necessary, as incomplete invoices imply inconveniences. Tradition- 
ally, a document classification feature was required, because out of a set of pre- 
configured analysis recipes the system had to chose the appropriate one for the docu- 
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merit under consideration, i.e. its class, (cf. [1,2,5]) However, in I-R-Sil is sensible to 
simply consider the creditor as the classification. Then, as explained above, the credi- 
tor can be used to retrieve from an online ERP-connection the list of possible invoice 
line items. 

Multiple Mandators 

Many practical settings require the I-R-S to handle invoices from more than one input 
source (mandator) in parallel, and to keep them separated at all times. Basically, the 
system needs to be able to import invoices together with labels, distinguish the in- 
voices with the labels, and finally export them to different targets. Actually, however, 
systems should be able to analyze invoices from different sources with completely 
different analysis recipes. One supplier (SER GmbH) offers an even smarter feature: 
encryption of the analysis recipes. Then, human operators of one mandator cannot 
peek into the analysis recipes of the other mandators. 



3 Telephone Interviews with User Companies 

70 companies - trading, manufacturing, and combinations - were chosen, out of 350 

companies initially called. These 70 receive more than 500 invoices per day and have 

from 1000 up to several 10000 employees. This section very briefly enumerates a few 

of the results. 

• Process costs for one invoice (opening, sorting, internal delivery, data typing, ar- 
chiving) were estimated as 9 Euro. 

• 70% use an archive system. (Ixos 23%, Docuware 12%, ...). Most of the remaining 
30% are planning to launch an archive. 

• 97% involve their specialist departments for the invoice acknowledgement. 

• 99% cross-check their orders and delivery notes with their invoices. 

• Temporary peak loads rise up to 20% over normal. 

• Companies with more than 2000 creditors (75%) consider only 20% of them “ac- 
tive”, i.e., they send invoices frequently. 

• Invoices take 6.5 days on average from reception to clearance. 

• 50% use electronic invoicing. 70% of the other 50% consider or plan electronic 
invoicing. However, of the first 50%, 80% of invoices still come on paper. 

• 29% scan received invoices. Most, thus, type the data from paper into their sys- 
tems. 

• 30% archive their invoices early so that they are accessible during processing. 70% 
archive only after the invoice processing is finished. 

• Interview partners subjectively guessed the error rates of mistyped/misread invoice 
data to be between 0 and 20%, with an average of 5%. 

• The average of processed invoices per employee per day are 126. 

• 85% consider the future of invoice processing to be digital. 

• 60% plan to automate invoice processing. Of these, 40% plan to implement it in 
the next 12 months; 25% in the next 6 months. 

• Information on intended investments were sparse: from 50 000 € to 200 000 €. 

• 77% decline to test prototype systems. 

• 73% dislike to think of outsourcing the capturing of their invoice data. 
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4 Systems Overview 

This section enumerates the systems in alphabetical order of their suppliers names. 
The 11 presentations took place from 25. August 2003 to 1. September 2003, at Py- 
lon, in Frankfurt, Germany. The last sentences of the subsections are recommended to 
read in order to start and get an overview. The further requires to read between the 
lines, however, the style of presentation transports an idea of what we consider good 
at least. To conquer the true promise of this section requires, to compare the formula- 
tions, and to digest the spread details. For professionals some suggestions were depos- 
ited. 

BasWare 

The philosophy “process-orientation instead of archive-orientation” shows that Bas- 
Ware identify a greater potential in the support of processes with invoices than in 
invoice computer analysis. At least in Scandinavia they could not be much closer to 
the mark. The I-R-S module is more an amendatory component, is only template- 
based and does not use image preprocessing. The verifier is rather convenient and 
allows Drag&Drop from the image into the entry mask data fields, demonstrating that 
ancillary tools can be taken seriously. For projects with stronger demands on I-R-S, 
BasWare is willing to cooperate with stronger analysis suppliers. The Scandinavian 
mother of the German sales department, which is quoted at the stock exchange and 
top rated, has its focus on e-procurement and has more than 500 users in the area of 
invoice workflow support. There are interfaces for 50 ERP-systems. BasWare is Mi- 
crosoft certified (offers MS Exchange for Workflows). User- tailored systems are 
configured from building blocks with parameters within 4 to 8 weeks. Basware is 
interesting for customers, whose optimization potential is mainly concentrated in the 
invoice acknowledgment workflows. 

Captiva 

In the area of data capture Captiva has been active for quite a while; however, a soft- 
ware platform to plug document analysis components together has been in their port- 
folio for a long time. The German Captiva has existed since 1999. Having realized 
that it is a big, strategic mistake to anticipate a German market for form recognition, 
the German Captiva is now correcting this mistake. The technology, a template-based 
approach, still has some minor deficiencies. There is no module for the collation of 
pages to documents. The verifier has potential for ergonomic improvement. The sys- 
tem has no user management, i.e., cannot administer rights and duties to users. Multi- 
ple mandators are not implemented, due to no demand from respective projects yet. 
The number of invoice line items is limited to 30 per page. The strengths of Captiva 
are the online-connection to SAP, with which the available order data can be ex- 
ploited for the analysis of invoice line items, thus achieving the best results possible. 
Eurther, Captiva have their own OCR modules, e.g., for such special cases as to rec- 
ognize checkboxes that were ticked and later revised. Captiva has understood and 
successively corrected the mistakes at the first generation of I-R-S solutions. They 
have the resources to further the development of a standard product. Captiva is on the 
right track. 
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Dicom 

The technological roots of Dicom are in the USA. Since 2002 Kofax has been owned 
by the corporation. The product of Dicom consists of AscentCapture, DOKuStar from 
OCE, and a pre-prepared set of rules, which is configured for specific projects. The 
added value of their solution is claimed to lie in the use of AscentCapture (many stan- 
dard interfaces and decentral scan workflow). Sales in Germany focuses on medium- 
sized businesses. In the German market, the potential of a multi-national company 
group with strengths in forms data capture is only of limited advantage. However, 
Dicom has a good network of sites and features user interfaces in many languages. 
The ergonomics of the verifier will probably change with some more practical experi- 
ence from projects; e.g., there will probably be a view to allow to zoom in on invoice 
image parts. The strength of Dicom is in the area of scanning of invoices and their 
capture platform. 

Docutec 

Docutec is an established player in the German arena of document analysis and un- 
derstanding. The Docutec technology is designed for the whole paper mail in-box. 
Thus, analysis of 100 to 60 000 invoices per day is no great challenge to them. For 
tailor-made add-ons the Docutec technicians have some tools ready in their toolbox. 
Self-made OCR tools and a number of obviously successfully finished projects indi- 
cate that. Docutec has a remarkable concept of approaching the scenarios of custom- 
ers, the “DocuWay”: after analysis of the customer’s document-processes their best 
optimization potentials are determined via a quality-assured process cooperatively 
with the customer. The implementation is individually fine-tuned, achieving a com- 
plement of technical and organizational measures. The customer- to-Docutec-contact 
is continually handled by one and the same Docutec person from the outset, through 
the whole project, up to the aftercare. It is possible to extend or optimize the standard 
I-R-S system. The then necessary specification of document classes and analysis reci- 
pes, uses a nice paradigm: the recipes are created in an object-oriented manner from 
smaller analysis recipe fragments. Beyond the standard connection to SAP (creditors 
and order data), Docutec offers modules for verification and acknowledge workflows 
in or parallel to SAP by partners. Docutec’s main strength is their „DocuWay“. 

FreeFormation 

The technology of the German analysis expert FreeFormation is in the form of Prof. 
Blasius, Trier, for many years a figure on the German market. His approach, referred 
to by many as “alternative” (actually “constraint based”) is rather efficient and suited 
for the analysis of the whole paper mail inbox. However, its specific strengths come 
to bear especially at I-R-S in the respective standard system. The method does not 
need pre-defined templates (hence the “freeform” in the company name). The flexibil- 
ity of the technology together with the experience of the technicians allows them to 
master even complicated and unique I-R-S challenges. The connection to the most 
frequent ERP-systems is an explicit part of the philosophy of FreeFormation for a 
long time. FreeFormation approaches customer scenarios with the analysis and sup- 
port of the processes. With a self-made engine, independent of the ERP-system, Free- 
Formation presents themselves strong in the area of verification and acknowledge 
workflows for the invoices. Hitherto, customers were large- and medium-sized com- 
panies with 100 to 15 000 invoices per day, many with international invoices. The 
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verifier is very convenient, e.g., it offers Drag&Drop from the TIFF-image to the data 
entry fields. “System teaching”, i.e., human advice of the location of non-system- 
spotted data, is available to the system instantly to overcome similar cases. Also, 
accumulated statistically, this information is used in other, not-so-similar cases. By 
and by, the information ages and is forgotten, so that the system does not “learn 
dumb”. FreeFormation presents a round, integral, and complete solution. 

Insiders 

The German analysis expert Insiders has been on the market with its I-R-S since 
Q 1/2003. This system can at any time be flexibly extended to the full-fledged system 
for the whole mail inbox, paper and e-mail. Special modules, like for crossed-out 
words with colored lines, have been implemented for projects. Insiders is also experi- 
enced with special standard developments for given classes of business. Even though 
Insiders had no customer for the simpler case of I-R-S, they have a lot of experience 
with medical invoices. This allows them to assess a complete I-R-S project, from the 
analysis of the customers processes, a pilot, up to the launch of a productive system 
within 10 to 20 working days. From experience this time is necessary mainly for the 
customer to connect the I-R-S with their IT infrastructure. The analysis approach for 
invoices does not require to set up templates. For customer specific optimization, 
however, this technique is ready and optionally available. Then documents can be 
classified according to tables, layout features, and with respect to content, and can be 
separately treated. The state-of-the-art connection to SAP imports creditor data and 
accesses order data. ERP-external workflow is realized with partners. New knowl- 
edge, which comes into the system from corrections at the verifiers, is immediately 
applied to all invoices in the system, i.e., it is applied also to those documents which 
were already completely analyzed and are in the queue to be verified. Insiders fea- 
tures the most technologically complete product. 

ITESOFT 

The German ITESOET has its roots in the systems house EormConsult. The software 
of ITESOET, with a strong 20 years history in forms reading, comes from its Erench 
mother company. The German market is served by ITESOET with a standard product 
for I-R-S for customers with more than 50 000 invoices per year. For fine planning 
and implementation of the product ITESOFT assesses 8 to 12 weeks. More compli- 
cated scenarios can be served as well, perhaps partly involving the development de- 
partment in France. The system is templated-based and, thus, must be configured. 
Unknown invoices make the system automatically create a new template, which needs 
to be manually validated. Stacks of invoices with different numbers of pages must be 
structured with separating pages. The ERP-connection is implemented individually 
for projects. The verifier interface features a document structure view. The mask 
automatically adapts, i.e., changes, when viewing either invoice heads or invoice line 
items. ITESOFT offers a module for the graphical configuration and control of the 
internal system workflow. 

Kleindienst 

Kleindienst have their roots in services for banks and thrift institutions. Since the 
acquisition of the former German ICR they can be counted within the category of 




Results of a Study on Invoice-Reading Systems in Germany 459 



companies ready to analyze mail inboxes. The standard product for I-R-S is immedi- 
ately executable at the customer’s site without extra configuration, and is already 
aware of international specifics like French date formats. It is installed largely by 
partners. In projects up to 35 000 active creditors were realized. Data fields that are 
manually changed at the verifiers, are automatically retrieved on the invoice image 
and their position(s), and near keywords are stored. After some reiterations the infor- 
mation is sent to the central knowledge base to improve the analysis and is also shown 
as colored boxes to verifier personnel. With some interesting tools, e.g., a standard 
two-level OCR strategy (fast first, slow but more accurate second), Kleindienst is able 
to analyze difficult images like faxes and small fonts reliably. In order to save space 
at archiving, a converter from 300 to 200 dpi is offered. The SAP-connection is state- 
of-the-art (creditor and order data). An extra module is available to find and remove 
duplicates in SAP-master data, which usually greatly reduces the number of entries. 
The verifier is easily configurable, by editing a file. The experience of Kleindienst 
showed in seemingly simple remarks, e.g., that systems in larger scenarios often need 
to be installed as NT-services. Kleindienst impressed us with a comfortably controlla- 
ble and administrable tool. 

OCE Document Technologies (ODT) 

ODT has worked in the document analysis sector for a long time, inter alia in the 
analysis of the whole mail inbox. Traditional customers have been greater businesses, 
however the trend is also toward medium-sized businesses. The ODT standard prod- 
uct for works without templates. Common scenarios can be served in very short 
time, sometimes in only a couple of days. ODT has acquired substantial experience in 
a number of projects, e.g., dedicated to image pre-processing and has participated in 
research projects. The feature to distinguish invoices of different mandators can be 
implemented for projects. A learning mechanism is underway. The SAP-connection 
can only access the creditor data. The analysis modules and verifier do not share the 
same set of plausibility rules. The SAP-integration is taken seriously. Partners assure 
a strong SAP-competence. It should be noted that the RecoStar OCR from ODT is 
bought and used by a number of competitors. ODT is characterized by a remarkable 
volume of experience, which shows in a plethora of different document-centric algo- 
rithms. They can solve a multitude of special problems. 

Paradatec 

Paradatec is a small document analysis specialist with a traditional focus on the mail 
inbox. Paradatec serves large businesses as well as small- and medium-sized busi- 
nesses exclusively over partners. However, they like to be involved in large or critical 
projects. The standard product is pre-configured for mail inboxes and is adapted for 
specific projects. No configuration of templates is required. The system has very good 
performance. It is based on an ingenious, self-made OCR, which allows them also to 
analyze difficult images, like faxes. The external interfaces, also including the con- 
nection to ERP-systems, is assured through the embedding of the system into Ascent- 
Capture. Creditor data can be imported; up-to-date order data cannot be used. The 
verifier user interface offers a nice stack structure view. With the mouse one can 
move a magnifying glass over the invoice image. The interface has far developed 
ergonomics and indicates of some experience with bulk processing. Compared to the 
other suppliers, Paradatec seems to be closest to the (positive!) cliche of the high-tech 
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garage company. Paradatec is always worth an inquiry call for complicated document 
analysis and understanding problems. 

SER 

SER were recently awarded a prize for the best DMS. Their I-R-S technology is de- 
signed for the whole paper mail inbox. The largest project to date processes 32 mil- 
lion invoices per year. With a standardized procedure, analyzing scenarios process- 
orientedly, their standard I-R-S can be launched very quickly - in typical scenarios, 
sometimes faster than a week. The technicians of SER and their technology are pre- 
pared for individual and special problems. The SAP connection accesses creditor and 
order data. The analysis recipes for documents of different mandators are encrypted, 
so that documents and analysis recipes are secured against meddlers. Eor non-found 
data items (typically only the invoice number, because crosscheck information is 
hardly available) the candidates on the invoice image are highlighted in the verifier. 
With a mouse-click the right candidate can be adopted. Learning knowledge from 
corrections at verifiers can be harmonized with corrections from other verifiers and 
activated at a validation user interface. The knowledge is represented with a so-called 
learning set, i.e., a set of invoices, together with marks, which indicate where the 
system searches. A self-made module for invoice validation workflows will be avail- 
able soon. SER has a comprehensive product portfolio and thus also masters related 
tasks of I-R-S in businesses, like archiving and knowledge management. 



5 Strategic Disclosures 

All suppliers have consistently referred to the fact that, at first, customers cannot 
believe that the systems actually work. 

Except for Insiders (a spin-out of DEKI) and ODT, the suppliers do not employ in- 
novative technology, and do not build on available scientific results, nor other litera- 
ture; or they pretend not to. Perhaps related to this, most of the suppliers misjudged 
the capabilities, strengths, and weaknesses of their competitors. 

The suppliers substantiated, that in most scenarios, with an online ERP-connection 
the recognition rates and recognition qualities were very high in the 90 percents. 

The required broad spectrum of technological competences (like OCR, invoice 
verification workflow, ERP-connection, etc.) and the necessary knowledge of the 
branches of business of users, seem to be reliably realizable with partner networks. 

The experience of the study authors was that after a careful problem analysis the 
return on investment (ROI) of I-R-S projects can be reached within one year. This 
perception was (without solicitation) supported by BasWare, OCE, and Kleindienst. 
This fact should certainly not be misinterpreted that projects launching I-R-S are in- 
evitably successful. Naturally the common rules and methods to make projects suc- 
cessful have to be obeyed here as well. 

Even interested customers react tremendously negative on terminology they con- 
sider technical (e.g. “pattern matching”). 

The distinction between template and freeform strategy is not only technologically 
sensible, but very important politically: we were told about large customers, which 
had invited suppliers for demos, who immediately cancelled the demo when the sup- 
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plier said, that their system uses templates. Note, that the buzzword “forms” or “form 
recognition” is considered comparably. 

I-R-S requires the classification of invoices according to creditors only. Such clas- 
sification does not require a so-called “semantic” methodology, working on the basis 
of textual (i.e. content) similarities. Three suppliers, however, have this methodology: 
at their disposal SER and Insiders own components; Docutec has a contract with 
Amenotec. 

If a module is marketed with a buzzword (like “artificial neural networks”) the par- 
ticular rationale for this technology for the task at hand should be explicated. In our 
case, the supplier could not explain the use of artificial neural networks, as the mod- 
ule was a “black box” from another supplier. Perhaps due to the stereotypical German 
rudeness we asked if they believed that the module can do its job with an artificial 
neural network particularly well. They admitted that the latest advice was no longer to 
mention artificial neural networks being used by this module. These people were 
really nice, however, they had to market their system. Many have to do that, don’t 
we? 

If in scenarios order data or delivery notes are not available, it is necessary to em- 
ploy methods for the analysis of free tables. The presentations were used only by 
Insiders, Paradatec, and - with restrictions - SER and Kleindienst, to demonstrate 
some respective features of their systems. Interestingly, in the questionnaires, almost 
all suppliers claimed to be able to analyze tables. However, they referred only to the 
ability to analyze the invoice line items, assuming the availability of the order data. 
Thus, they actually analyze the table data, but do not employ specific table methods at 
all. 

The launching of an I-R-S should always prioritize the optimization of the invoice 
handling processes. Approaches which are motivated software-technologically are 
thus seldom adequate. 



6 Discussion 

This paper introduced to “invoice reading”, a document analysis application field, to 
be found anywhere, where companies run ERP systems and receive printed invoices. 
We reported about 1 1 companies and systems that are available on the German mar- 
ket for this purpose. We further reported very briefly some results from 70 telephone 
interviews with potential customer companies about their thoughts, requirements, and 
constraints. 

We hope, readers started drawing their own conclusions. However, we provide 
some templates here. If we say about a system “The verifier is very convenient, e.g., it 
offers Drag&Drop from the TIFF-image to the data entry fields” this shows that the 
usability and thus also the GUI of a system and even special features like Drag&Drop, 
are crucial for us (not alone the authors) to consider a system mature. 

Because “All suppliers have consistently referred to the fact that, at first, custom- 
ers cannot believe that the systems actually work”, we conclude that even those indi- 
viduals in public, which are professionally searching for document analysis technol- 
ogy, do not know what our document analysis technology is capable of. We believe 
that this is both, disadvantageous for our discipline, and different in other fields of 
computer science and artificial intelligence and that this is a strong motivation to get 
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active and change this. The fact that ‘'Even interested customers react tremendously 
negative on terminology they consider technical (e.g. “pattern matching”}” can be 
interpreted, as it was worth to try to better adapt or connect our language to the public 
language. Where could we publish respective papers? 

Because “77% [of the customer companies] decline to test prototype systems”, we 
conclude that it is necessary to try to design standard-systems for standard-problems. 
And further, because “The launching of an I-R-S should always prioritize the optimi- 
zation of the invoice handling processes” such standard problems should be tried to be 
derived from typical invoice handling processes of typical companies. 
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Abstract. It is well known that integration of multiple OCR outputs can give 
higher performance than a single OCR. This idea was applied to the printed 
Japanese recognition and better performance was obtained. In the conventional 
experiments, however, the zoning, i.e. the extraction of the text region, was 
done manually and this has been a serious problem from the practical point of 
view. To solve the problem, an approach to match automatically the classified 
regions outputted by multiple OCRs was proposed. By the proposed method, a 
high recognition rate of 98.8% was obtained from OCR systems whose per- 
formance is no better than 97.6%. 



1 Introduction 

It is well known that integration of multiple OCR systems can give higher recognition 
rate than every single OCR system constituting the integrated system. The idea is not 
new and a patent claiming the idea was filed in Japan at early 1960’s [10]. In the pe- 
riod the idea could not be installed by the cost problem, but the recent progress of the 
so-called “Software OCR” has made it more practical. Many papers [1-3] describe the 
effect of the integration. Rice et al. [1] reported the effect of the integration in printed 
English document recognition using six OCRs with a voting logic. Matsui et al. [3] 
suggested the same idea in handprinted numeral recognition. 

Tabaru et al. reported experimental results of the idea applied to printed Japanese 
document recognition [4] . 

In Rice’s or Tabaru’ s systems, the zoning, or the extraction of the text regions to be 
recognized, was executed manually and the extracted regions were fed to each OCR. 
In this sense, the applicability of the method has not been large, because of human 
intervention. 

Recently Klink et al. [6] presented a full automated system which can integrate seg- 
mentation results using a voting logic. The documents processed in this paper are 

S. Marina! and A. Dengel (Eds.): DAS 2004, LNCS 3163, pp. 463-471, 2004. 
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printed in European languages and the applicability for the Japanese documents is 
unknown. 

In this paper a further experimental result of the majority logic using JEIDA Japa- 
nese printed document database is presented. Besides, a method to integrate multiple 
OCR zoning results is explained and its performance is shown. 



2 Majority Logic Using Multiple OCR’s 

2.1 Tabaru’s Results 

One of the authors reported the effect of integration of multiple OCR devices using 
majority logic [4]. In this report, a not large dataset consisting of 1,476 Japanese char- 
acters was used. So, intensive experiments using lager datasets may be necessary to 
ensure the reliability. 

2.2 Layout, Typeface, Font Sizes, and Numbering 

In this section the result of an intensive experiment is presented using six commercial 
printed Japanese software OCRs which are listed in Table 1. 



Table 1. OCRs Used in this Section. 



Software Products 


Producer Names 


Yonde! Koko ver.3™ 


A.I.Soft™ 


OmniPagePro 6.0j ™ 


Caere™ and Canon™ 


Yomitorimonogatari EX v2.5™ 


Ricob™ 


Ninshikikohboh Wide 97™ 


RIOS SYSTEM™ 


e.TvPVSt bilingual 97™ 


MEDIADRIVE™ 


WinReaderPRO v3.5™ 


MEDIADRIVE™ 



Samples of the printed Japanese documents were selected arbitrarily from JEIDA 
document database [7]. The document IDs and characters included in them are listed 
in Table 2. All of them include Kanji as well as Latin characters. The character num- 
bers are sum of these two kinds. 

The sampling pitch used in scanning of the JEIDA database is 600 dpi and the im- 
ages are fed to each OCR with this resolution. The “zoning” was done manually. In 
other words, the text regions to be recognized were determined by pointing devices on 
the screen by an operator. 

To apply majority logic to the multiple OCR outputs, the faulty segmentation be- 
comes a severe problem. If there are deleted or inserted characters caused by the 
faulty segmentation in some OCRs, the correspondence between different OCRs will 
give meaningless results. In many Japanese characters, especially kanji, have several 
components (radicals) in them. So, sometimes a character is divided into two or more 
simple characters. On the contrary, two simple characters are merged and segmented 
as a single complex character. By the reason, deletion or insertion occurs rather fre- 
quently and the matching character strings of different lengths is very important. 
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Table 2. Document Samples Used. 



Document Ids 


Category 


Characters 


P010102 and others 


Newspapers 


10,915 


P020101 and others 


Novels 


5,495 


P030101 and others 


How-to hooks 


3,339 


P040101 and others 


Elementary School Textbooks 


2,326 


P070101 


Manuals 


289 


Total: 25 documents 




22,364 



Since Japanese sentences are printed without spaces between words, the correct 
correspondence should be taken not among the words but among the printed lines. A 
method used by Tabaru [4] to find the best match between lines resembles to the dy- 
namic-programming-based string matching algorithm [8], though not exactly same. In 
this paper, Tabaru’ s method is adopted also. 

In the experiment, six OCR recognition results were compared after the adjustment 
by the matching. A forced decision scheme is adopted in each OCR and the first can- 
didates made by all OCRs were treated as the results and no rejections were consid- 
ered. 



Table 3. Experimental Results: the Numbers of Errors and Error Rate. 



OCR 


Substitutions 


Insertions 


Deletions 


Recognition Rate (%) 


A 


541 


77 


6 


97.2 


B 


1,349 


103 


46 


93.3 


C 


442 


31 


23 


97.8 


D 


453 


48 


13 


97.7 


E 


413 


17 


11 


98.0 


E 


373 


124 


24 


97.7 


Majority 


112 


15 


0 


99.4 



Simple majority logic was used to give the final decision. When three-three or two- 
two-two ties were found, the earliest result was adopted. Since the forced recognition 
mechanism was adopted, the errors were classified into substitution, insertion and 
deletion. No rejection scheme is used. 

The recognition errors of six OCRs and the virtual recognition system by the major- 
ity logic are shown in Table 3. Number of total recognized characters are 22,364 as 
shown in Table 2. In the calculation of recognition rates, the sum of three type errors 
divided by total number is subtracted from 100%. 

As shown in Table 3, it is obvious that the majority logic yields a very good per- 
formance. Especially the improvement in deletion is remarkable. 



3 Automatic Matching of Text Regions 

3.1 Practical Viewpoints 

Commercial OCR software can analyze a document image and segment it into the 
layout structures consisting of texts, mathematical expressions, figures, tables, photo- 
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graphs, captions etc. Logic structure extraction, e.g. header/footer/title/reference iden- 
tification, is one of research targets now. 

As stated in the previous chapter, many reports on the integration of multiple OCRs 
seem to be based on the manual zoning. In order to bring the integration technique into 
practical use, it may be necessary to avoid or minimize human intervention. 

A fully automated integration of several OCR outputs, therefore, require matching 
extracted regions which are labeled differently or disorderedly. Some regions may be 
deleted, merged or divided into several regions with some OCRs. 

So far, Kling et al. [4] propose a technique called MergeLayouts to integrate faulty 
segmentation made by multiple OCRs. Their method is for European languages, and 
can use the fact that words are printed with spacing between them. 

On the contrary, Japanese sentences are printed without spaces between words and 
a new technique will be necessary. Furthermore MergeLayouts uses font and size 
information outputted by OCRs. But it is not common to output such information in 
Japanese OCRs. To develop a method to integrate OCR outputs only by character 
codes is another purpose of this paper. 

3.2 Difficulties in Region Matching 

In this research, text regions are only considered and figures and/or photographs are 
not taken into account. If a nontext region, say a photograph, is recognized as a text 
region by all OCRs, the outputted texts may be integrated. But such a case does not 
seem to occur frequently and the misrecognized text regions exist in only a part of 
OCR set. As a result, we can suppose that all regions successfully matched among 
many OCRs are text ones. 

Under the consideration above, regions to be matched consist of text lines. So, the 
matching must be done on the basis of character string correspondence. 

By observing the outputs from printed Japanese OCRs, the problems stated below 
are noted. 

• A line sometimes divided into multiple text regions, especially when a large space 
area exists in it. 

• The sequence of the text regions differs OCR by OCR. This phenomenon occurs 
notably when vertical and horizontal lines exist in a document simultaneously. (In 
Japanese documents, it is not rare to mix horizontal and vertical lines in a page.) 

• In extreme cases, column extraction fails and several text lines in different columns 
are merged into one horizontal line. 

• Captions of figures/tables, headers and footers are sometimes recognized as a part 
of another region. 

From these problems caused by the document image analysis, it is not easy to 
match same text regions on document images. An example of erroneously analyzed 
documents is shown in Figure 1. Furthermore, in the character recognition subsystem, 
deletion or insertion occur frequently as stated in the previous section. 

3.3 Text Matching Algorithm 

In this research, text region matching is done on the basis of line similarities. 
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Fig. 1. An example of erroneously analyzed document images: the sequence of the regions is 
not considered appropriate. 



Distance of two lines is calculated by the dynamic programming (DP) method [8]. 
DP matching algorithm is widely used in string matching, speech recognition and 
character recognition, so it is explained very simply here. 

Denote two text lines as character strings {a^, a^, .... aj and {b^, b^, ..., bj. The dis- 
tance d( i, j) between two character substrings A.={a^, ..., aJ ( i= 1 to m) and B ={bj, 

b^, ..., b j (j = 1 to n) when two substrings are best fitted is evaluated iteratively as 
follows: 



initial values: 



d(0,0) = 0 

< d(i,0) = d(i-l,0)+l, for 1 <= i <= m 
^ d(0,j) = d(0,j-l)+l,for l< = j <= n 



iteration: 



d(i.j) = min 



d(i-l,j-l)+h.. 

d(i,j-l)+l 

d(i-l,j)+l 
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where h-. denotes the distance between characters a and b.: h.. =0 when two characters 

ij t j’ ij 

are same and =1 otherwise. Based on the iteration the distance between two total sub- 
strings and is calculated. 

3.4 Construction of a Standard OCR 

Based on the distance of text lines, distance between two text regions is calculated. 
The following is simplified explanation of the procedure. 

First it is necessary to determine a “standard OCR” as the most highly supported 
one by other OCRs. The standard OCR is determined as follows. Denote two OCR as 
OCR(i) and OCR(j). In this process, all lines in all regions are collected from OCR 
outputs and numbered arbitrarily. 

1. Calculate distance d(k) between line 1 of OCR(i) and line k of OCR(j). 

2. The line k-min of OCR(j) whose distance is smallest from line 1 in OCR(i) is allot- 
ted to the line 1 in OCR(j). When the smallest distance is greater than a threshold, 
line 1 of OCR(j) is not determined. 

3. Similarly, for all lines in OCR(i) corresponding lines in OCR(j) is determined. 

4. For every OCR(i) the value of distance for all line pairs between two OCR is 
evaluated and average distance by dividing line numbers and is calculated. 

5. After the calculation of the averaged distance of every OCR from others, an OCR 
which has the smallest average is set as the standard OCR. 

3.5 Judgment of Analysis Failure by Text Line Sorting 

Document analysis result of each OCR is checked by a human operator if it corre- 
sponds to the human judgment. The results are compared with those of the automatic 
region correspondence method. 

Automatic region correspondence is done only by the text line sequence. First the 
standard OCR is constructed by the method stated in the previous section. Then the 
text lines in outputs of other OCRs are sorted according to those of the standard OCR 
using the text line similarity. 

When all lines can be sorted according to those of the standard OCR, the region 
correspondence is considered as a success. Otherwise, the analysis is considered as a 
failure. 



3.6 Voting 

The method to vote OCR results at character level is almost same as the one used in 
section 2.2. In this case, however, a rejection scheme is introduced. 

Using the standard OCR as a criterion, all lines in another OCR can be sorted based 
on the correspondence of each line before the majority logic is applied. 
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4 Experimental Results 

4.1 Environment 

In the experiment explained in this chapter, five software OCRs listed in Table 4 
(from A to E) are used. OCR F is omitted because it was supplied hy the same maker 
as OCR E. 

For a part of the documents in JEIDA’93 which were analyzed successfully by our 
method, the voting method was applied to the OCR results. The performances were 
almost as same as those in Table 3. 

4. 2 Environment Document Samples 

As same as Section 2.2, JEIDA’93 was used and 77 images were selected from the 
database. In the dataset, 51 images contain figure and tables and 26 do not contain 
them. Since JEIDA’93 includes very tough samples from the document analysis view- 
point, easier samples were added. As easier samples, 50 academic papers, some of 
which contain figures, were used. 

4.3 Experimental Results 

Table 4 shows an experimental result of region recognition of each OCR. OCRs from 
A to E are five OCRs used in the experiment, but the order has nothing to do with that 
in Table 3. In this table “Failure” means that the final extraction of regions failed or 
the erroneous sequence of regions was obtained. In Table 4, OCRs A-E do not coin- 
cide with the sequence in Table 3. 



Table 4. Failures ii 


1 Region Recognition 




OCR 


JEIDA’93 without 
Eigures 


JEIDA’93 with 
Eigures 


Academic 

Papers 


Total 


A 


10 


4 


4 


18 


B 


14 


8 


4 


26 


C 


7 


1 


10 


18 


D 


14 


10 


20 


44 


E 


9 


4 


6 


19 


Total 


51 


26 


50 


127 



Table 5 shows the failures in the virtual OCR which integrates the results of OCR 
A-E by the proposed method. 

From Table 4, it is evident that some OCRs are poor in the region recognition. 
Table 5 shows that the proposed method gives good performances and it is promis- 
ing for simple structure documents such as academic papers. 

In this experiment, the total number of characters is 4,813. In the calculation of the 
recognition rate, rejection is weighed as 1/2 of substitution. 
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Table 5. Failures in Region Recognition by the Proposed Method. 



OCR 


JEIDA’93 without 
Figures 


JEIDA’93 with 
Figures 


Academic 

Papers 


Total 


Our Method 


3 


7 


2 


12 


Total 


51 


26 


50 


127 



Figure 3 shows a sample of recognition results by the proposed method. Here, R 
shows that the input character is rejected and candidates for the rejected character are 
shown in braces. 
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Fig. 2. A region in an input image (JEIDA’93) extracted by OCR’s and successfully matched by 
the proposed algorithm. 



5 Concluding Remarks 

In this paper we showed that integration of multiple OCR results is useful both at the 
document analysis stage and the character segmentation and recognition stage. 

A simple method to integrate the faulty document analysis and character segmenta- 
tion and recognition is proposed in this paper. An experimental result is shown using 
commercial OCRs and printed Japanese documents, which include JEIDA’93 and 
academic papers. 

The proposed procedure to match the regions outputted from multiple OCRs seems 
to work well for simply structured documents. The character recognition rate in the 
successfully analyzed text regions is improved from 92.1-97.6 % to 98.8% and proved 
the high performance of the proposed method. 

In order to make the proposed method more robust, more sophisticated matching 
process between regions will be necessary, especially when figures and tables are 
included in the documents. 

The idea of integrating multiple OCRs is applicable to wide area. Some of the au- 
thors have already reported an approach using the resembled principle to English cur- 
sive handwritten word recognition [9] . 
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Fig. 3. Recognition result of the region shown in Fig. 3. “R” marks in the recognition result 
show rejection errors and the characters in the braces are candidates for each rejection. 
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Abstract. In this paper, we present DocMining, a general framework that allows 
the construction of scenarios dedicated to document image processing. The 
framework is the result of the collaboration between four academic partners and 
one industrial partner. The main issues of DocMining are the description and the 
execution of document analysis scenarios. The explicit declaration of scenarios 
and the plug-ins oriented approach of the framework allow to integrate easily 
new Document Processing Units and to create new application prototypes. 
Moreover, this paper highlights the interest of the platform to solve the problem 
of performance evaluation. 



1 Introduction 

The design of flexible and adaptable document analysis systems is a very complex 
task implying the sequential ordering and the tuning of parameters for many software 
components and algorithms. These components correspond to the classical processing 
steps of a document understanding process, going from low level processing (filtering, 
physical segmentation, ...) to high level understanding techniques (layout analysis, 
meta-data extraction,. . .). 

Even if some of these techniques may be considered as mature from a functional 
point of view, their integration in the context of a generic and flexible approach is a 
difficult task. The heterogeneous representation of the information and the various 
objectives of document understanding processes make the sequential ordering of these 
techniques and the tuning of their parameters a cumbersome problem. In fact, the 
intention of the users can be diverse, going from classical segmentation to perform- 
ance evaluation, through content-based image retrieval. 

S. Marinai and A. Dengel (Eds.): DAS 2004, LNCS 3163, pp. 472-483, 2004. 

© Springer-Verlag Berlin Heidelberg 2004 
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Many domain specific systems have been described in literature [2], and more pre- 
cisely many application specific systems. Even when the applied strategies are de- 
signed to be as generic as possible, the illustrations given for the system are limited to 
a single category of documents and, moreover, do not develop any quantitative evalua- 
tion of significant document databases. 

Actually, to the best of our knowledge, no such complete and generic interpretation 
system exists because of the necessity to have an excellent know-how in the imple- 
mentation of an interpretation framework. This expertise derives from both the docu- 
ment analysis technique point of view and the domain specific expertise, in order to 
adapt the analysis scenario to the intention of the document user [2, 3]. 

These points highlight that many knowledge categories are involved in such a 
document analysis system, and that a good manner to let the interpretation system be 
as flexible as possible is to model this knowledge and to separate it from the source 
code, which contributes to the global flexibility of the approach [3, 5]. 

Another way to tackle the difficult problem of heterogeneous document analysis is 
to have a pragmatic approach, consisting in interacting with the user for the definition 
of all these knowledge and analysis scenarios. This may appear as less ambitious than 
the approaches that try to implement generic and fully automatic systems, but it ap- 
pears as more realistic when dealing with a large category of documents, on which a 
user can be interested in a large number of analysis scenarios, using many processing 
toolboxes, according to specific intentions. 

This paper presents DocMining, a software platform that aims at providing a gen- 
eral and interactive environment for building such document analysis systems. The 
project is supported by the DocMining consortium, including four academic and one 
industrial partners: PSI Lab (Rouen, France), the Qgar team at LORIA (Nancy, 
France), L3I Lab (La Rochelle, France), DIVA Group (University of Fribourg, Swit- 
zerland) and GRl Lab from France Telecom R&D (Lannion, France), with the partial 
funding of the French Ministry of Research, under the auspices of RNTL {Reseau 
National des Technologies Logicielles). On the basis of a set of heterogeneous compo- 
nents, this platform allows us to solve numerous categories of document analysis prob- 
lems, through a set of strategic aspects that are described in the different parts of this 
paper. 

Section 2 presents the objectives and the foundation of DocMining; section 3 de- 
scribes in detail the architecture of the DocMining platform; section 4 illustrates three 
practical examples of scenarios; finally, the last section concludes the paper and pre- 
sents future perspectives of using the platform. 



2 Objectives and Foundation 

From the final user’s point of view, DocMining appears to be a Document Analysis 
System Builder, allowing to define multiple document analysis scenarios, on the basis 
of heterogeneous components, without any specific constraints concerning the compo- 
nents with a great flexibility. From the point of view of the DocMining consortium 
itself, the platform has allowed us to interface all the processing units coming from the 
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libraries of each partner in order to assess very different analysis scenarios, such as 
page segmentation, classification problems, performance evaluation... 

The platform relies on three specific ’’concepts”: 

• Dynamic scenario construction: The principle is to allow the user to interact with 
the system for the development of a problem adapted scenario, consisting in the or- 
dering of Document Processing Units (DPU is the equivalent acronym in the rest of 
this paper); DPUs are linked as a function of the user’s intentions. Furthermore, 
running a scenario collects user experience, which becomes part of the scenario it- 
self. The scenario may then he transformed into a new DPU corresponding to a 
higher-level granularity. Here, we can find similarities with electronic circuit con- 
struction devices, such as VHDL for instance. 

• Document centered approach: All the knowledge - data structure and DPU parame- 
ters - that is extracted during scenario execution is “archived” in the document. Its 
structure evolves as a function of the intentions and interactions with the users... 
Thus, the document actually appears as a communication channel between DPUs 
that are run by the users. Such an approach avoids the problems of data scattering 
usually met in classical document processing chains. 

• A plug-in oriented architecture: The principle is to permit the integration of hetero- 
geneous software components. As a consequence, developers can conveniently add 
new processing units, thus making the platform easily upgradeable. Document 
visualization and manipulation tools are also designed according to this approach, 
so that a user is able to fully customize the interactions with the document structure. 

As a consequence, DocMining is a platform that allows to quickly and easily solve 
any new problem, on the basis of a set of heterogeneous processing tools. But the 
advantages of this approach exceed the mere simplicity of implementing an image 
processing chain, since the document centered approach enables the user to share 
knowledge about parameter tuning, results and analysis scenarios. This advantage may 
appear very important in the context of collaboration between different research 
teams, such as in the DocMining Consortium. Two practical examples are that the 
DocMining platform offers features allowing to compare easily different strategies of 
document understanding or to share the know-how concerning the utilization of a 
particular processing tool. 

Actually, DocMining does not aim at representing the different categories of 
knowledge that are implicitly involved in a document understanding process, but of- 
fers a pragmatic environment helping the user to interactively and explicitly integrate 
this knowledge: 

• Document specific knowledge, including the representation of the objects, struc- 
tural/syntactic information, threshold concerning image processing techniques, . . . 

• Strategic knowledge, which deals with the processing chain (scenario), built when 
solving a particular problem. 

• Processing knowledge with parameter tuning, which remains a very difficult prob- 
lem when considering a new category of document, in a specific context. 

The DocMining platform has been experimented in the context of various problems 
(document segmentation, document vectorization, pattern recognition, performance 
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evaluation, ...) and with different documents (structured documents, graphic docu- 
ments, databases, ...)• It uses various software components stemming from the re- 
spective processing libraries of each partner in the DocMining Consortium (DIUF’s 
xmillum framework, Qgar library, PSI library, . . .). Some of these experiments will be 
presented in section 4. In the following section, we will present the architecture of the 
DocMining platform. 




Fig. 1. Overview of the DocMining platform architecture. 



3 DocMining Platform Architecture 

3.1 Architecture Overview 

The concepts presented in section 2 have led to the design of a platform for building 

and applying document analysis scenarios. This platform, called DocMining, relies on 

the architecture illustrated in figure 1. The platform contains five main components: 

• The document itself, comparable to a blackboard containing the current extracted 
data. 

• An extensible set of Document Processing Units, which works such as the “special- 
ists” in the blackboard paradigm. 

• A scenario, containing both DPUs and fine grained components designed for vari- 
ous purposes, such as data adaptation between the document and DPU, parameters 
observation, DPUs repetition... 

• A scenario engine called ImTrAc that interprets specific elements of a scenario, 
controls DPUs execution and document updates. 

• A set of user interfaces, assisting the construction of new scenarios and visualizing 
intermediate and final results. 

3.2 The Document 

Our approach is focused on the document, which can be considered in a first approxi- 
mation as a set of graphical objects. The document centralizes the major part of the 
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information involved in the processing and is progressively enriched by the DPUs. In 
order to store all this information, documents are represented by an XML tree that 
describes the graphical objects. The XML format guarantees the user the possibility of 
extending the set of already-defined objects. The DocMining consortium defined an 
XML tree structure standardizing the graphical objects. The tree is specified in an 
XML schema (figure 2a), which defines some primitive elements that can appear in 
the document (Greylmage, Binarylmage, Bloc, TextLine...), their attributes (position 
in the image for example), and their source. The schema is not exhaustive and can 
evolve with new data types, according to user needs. It is associated to a JAVA API, 
proposed in order to manage the objects of the structure. This library enables us to 
load, create and save object instances. 

The next sub-section shows how to define a DPU, the main issue for analyzing and 
enriching a document. 
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Fig. 2. XML schemas defining the structure of a document(a) and the scenario structure (b). 
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3.3 Document Processing Unit 

One objective of the DocMining consortium is to enable interoperability between the 
different processing libraries proposed by research teams. In such a context, five main 
processing libraries have already been included in the platform: The Qgar library, the 
PSI library, the JAVA Advanced Imaging, a DocMining classification library and a 
DocMining feature extraction library. These libraries contain various components, 
such as image processing tools, classification tools, structural/syntactical operators, 
and so on. The goal of this sub-section is not to describe the libraries already included 
in the platform [2], but to illustrate how to integrate new DPUs. 
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In order to be included in the platform, a unit has to respect two contracts. The first 
one is a declarative contract that describes four elements: 



• The offered service, describing effects of the DPU on the XML tree (creation, 
modification or fusion of data). 

• The target data of the DPU. This parameter defines the conditions (i.e. the XML 
elements to he present in the document structure) that allow the DPU to be trig- 
gered. 

• The DPU parameters characterized by their name, their type, their support, their 
possible values and their default value. 

• The resulting objects. 
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Fig. 3. XML schema for the contract definition. 



The contracts must respect the XML schema in figure 3. Thus, the knowledge con- 
cerning DPUs can be made explicit. Such knowledge encapsulates both the behavioral 
and operational aspects. According to us, such a model of processing is oriented to- 
wards explicit knowledge in document analysis and library sharing. 

The second contract is software oriented; it consists of a set of Java interfaces that 
direct the user in the encapsulation of algorithms into a DPU. 

A sequence of DPUs is involved in the construction of the scenario; its structure is 
presented in the next sub-section. 



3.4 The Scenario Structure 

The notion of scenario is another main issue of the DocMining platform. A scenario 
consists of a set of elementary actions that are run sequentially on part of the objects in 
the document. The DPUs presented in the previous sub-section are not the only author- 
ized actions: The scenario includes much functionality, allowing us to integrate het- 
erogeneous software components and to offer a “high” programming level to final 
users: 

• Triggering. It is run automatically at each step of the scenario. For instance, the 
trigger allows to isolate some intermediary information related with the running 
DPU: Time processing measures, global variable evolution following, . . . 

• Support adapter. It provides the capability of adapting a particular dataset to a DPU. 
The support adapter works as an interface between DPU and dataset. 
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• Instruction. This notion corresponds to code that is interpreted, thus offering the 
user simple scripting capabilities. . . 

• Variables, allowing the exchange of data between the other components provided 
by scenario. 

The XML schema illustrated by figure 2b declares the concepts, which define a 
valid scenario. The scenario engine, presented in the next sub-section, interprets these 
concepts. 

3.5 The Scenario Engine 

The scenario engine, called ImTrAc, interprets and coordinates a well-defined sce- 
nario, through the sequential ordering of available DPUs. It acts the role of a mediator 
between the document and DPUs. It parses the scenario and schedules the involved 
units. Firstly, ImTrAc loads the document and the scenario; then, it checks their valid- 
ity with respect to the XML schemas. After the variable and components instantiation, 
it analyses each scenario step: This starts detection of the activation nodes into the 
XML tree. Thus, a sequence of instructions consisting of pre-processing, DPU, post- 
processing and triggers is executed. This sequence may be repeated many times in the 
same step. Finally, ImTrAc searches for the next step defined in the scenario and re- 
peats the operations previously described. The fully execution of the scenario produces 
new data, thus enriching the document. The next sub-section presents the interfaces 
involved in the creation and validation of scenarios. 

3.6 User Interface 

Two User Interfaces have been integrated in the DocMining platform (figure 4). The 
first one, called SCI (Scenario Construction Interface), aims at providing a user- 
friendly assistance in order to build new scenarios. The second one, called xmillum, is 
a framework for visualizing and interacting with the produced data. Although the two 
user interfaces can be used separately, they communicate together, allowing the user 
to watch, correct and tune intermediary results of the scenario construction process. 

3.6.1 The Scenario Construction Interface 

The SCI tool allows the user to build new scenarios. For a given element of the docu- 
ment structure, the SCI tool is able to supply the list of DPU that may be applied. 
After the user chooses a unit, the SCI tool supplies its parameter list so as to be able to 
launch the corresponding process. The SCI tool has two modes for DPU execution: 

• The emulation mode that checks the contract, in order to determine which object is 
produced by a DPU. The document is updated according to the provided service 
and the produced object declared in the contract. 

• The execution mode actually applies the DPU to the selected elements by calling 
the ImTrAc engine. 

After each processing step, the document structure is updated and the user can in- 
teract with the newly created object. During the entire scenario construction, the SCI 
tool ensures the coherence and the validity of its structure. 
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Fig. 4. Scenario construction with SCI and results visualization with xmillum. SCI allows to 
build new objects and xmillum allows to reorganize the produced objects. 



3.6.2 xmillum 

xmillum is a framework for cooperative and interactive analysis of documents. It of- 
fers an interface that uses external modules to visualize and to interact with a docu- 
ment. The framework has been developed to work with documents represented in 
XML. The aims of xmillum in the DocMining platform are: 

• Visualization of documents. The interface of xmillum uses a background layer to 
represent the original document and semitransparent layers to visualize the new 
views, which contain the result of the DPUs. 

• Interaction with the document. The user can interact with xmillum in order to ma- 
nipulate the original document or the extended views of it. Interaction is useful for 
validating or correcting the results of DPUs. Another issue of the interaction is re- 
lated to the definition of ground- truthed data, which allows to parameterize DPUs 
defined in the scenario chain. 

The next section presents examples of interactive analysis of documents. 



4 Three Performance Evaluation Use-Cases 

The DocMining platform has been experimented in the context of various problems 
dealing with different documents, using various software components from the respec- 
tive processing libraries of each of the DocMining consortium partners, or from other 
contributions. In this section, we describe three practical cases: a scenario of segmen- 
tation evaluation, another for pattern recognition evaluation and a last for text-graphics 
separation. 
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4.1 Segmentation Evaluation Scenario 

Performance evaluation of algorithms has become a major challenge for document 
analysis systems. In order to choose the most valuable algorithm according to the 
domain or to tune algorithm parameters, users must have evaluation scenarios at their 
disposal. But efficient performance evaluation can only be achieved with a representa- 
tive ground-truth dataset. This application tackles the two aspects of performance 
evaluation, the construction of a ground-truthing scenario and the definition of a seg- 
mentation evaluation procedure. 

4.1.1 Ground-Truth Dataset Construction Scenario 

PDF documents serve as basis for our ground-truth dataset. Indeed, the PDF format is 
widely used in many applications (newspaper, advertising, slides, ...) and PDF docu- 
ments can be easily found on the web. Moreover search engines enable the user to 
refine a search according to the document format. So it is very easy to build a PDF 
document base where many domains are represented. A ground-truth dataset can also 
be built with newly created PDF documents based on the transformation of XML 
documents. With these two approaches, we can build a document base which contains 
“real life” documents obtained through an internet search and “problem specific” 
documents built from an XML source. Nevertheless, the PDF format has a major 
drawback: It is based on a pure display approach, so that structural and logical infor- 
mation is not directly accessible; this information must be computed from the low 
level objects contained in the PDF document. So we built a scenario that partially 
constructs the physical structure of a document. The ground-truth dataset is obtained 
through a three step scenario: 

• Select the PDF document. 

• Parse the PDF document and partially extract the physical structure. We developed 
a DPU, based on the PDF parsing API of the Multivalent package [6], that extracts 
content end location of letters, words and text lines. Additional elements such as 
images are also extracted. 

• Save the generated ground- truth structure. The document structure is saved accord- 
ing to the XML schema we defined. 

4.1.2 Benchmarking Scenario 

The second part of the application consists of evaluating segmentation algorithms for 
the raw dataset. This benchmarking scenario is composed of three steps: 

• Transformation of the PDF document into an image. The DPU we designed encap- 
sulates a ghostscript command. 

• Physical structure extraction using a page segmentation processing. At this time, we 
have two segmentation algorithms, one based on a classical top down approach and 
the other one based on a hybrid approach [4] . 

• Segmentation performance evaluation. Segmentation algorithms produce a resulting 
XML structure, which is matched with the ground-truth dataset to measure the re- 
gions overlap ratio. Ground-truth information is extracted from the dataset by using 
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Xpath expressions, which allow to select the desired corresponding structure. XPath 
expressions give great flexibility to the user to select exactly what he needs. He can 
modify fhese selection expressions by choosing another kind of objecf (words for 
example) or by adding consfraints (for example small areas may be filtered). Node 
matching itself is done with Yanikoglu’s method based on the ON pixels contained 
in a zone [7]. In order to ignore insignificant differences between the ground-truth 
regions and the segmented ones, only the black pixels content of fhe areas are faken 
into account. 

Although many page segmentation evaluation problems are not yet addressed in 
this experimentation, DocMining shows its ability to tackle many aspects of ground- 
frufhing and benchmarking. Its modularity can help building such a scenario according 
to users needs. Moreover it allows users to design their own performance evaluation 
algorithm. For more information about this application, one can refer to [1]. 

4.2 Pattern Recognition Evaluation Scenario 

This second application deals with recognition on “chicken dataset” composed of 5 
classes and issued from the lAPR TC5. This dataset is composed of binary images 
associated with features corresponding to contour coding. Such as in the first use-case, 
the goal is not to demonstrate the superiority of an approach, buf to show the advan- 
tage of using fhe DocMining plafform to compare results of various approaches. In 
fhis context, additional tools for features extraction (Fourier-Mellin invariants from 
fhe PSIlib, Hu moments from the openCV Library) and classification (API from 
PSIlib) have been implemented, that demonstrate the interoperability of fhe plafform. 
In the scenario, three parts have then been constructed: 

• The knowledge base construction, which is first constructed with correctly labeled 
images from the dataset. References to these images are included in the document 
at the first step of the scenario. Secondly, each symbol of the dataset is associated 
with its contour coding. This step is done through a specific DPU constructing a 
structured description of all the symbols and by using XPath expressions. 

• The new features addition part, where the symbol description is completed with a 
new set of features (Fourrier-Mellin invariants and Hu moments). A new specific 
DPU has therefore to be applied to enlarge the features database describing the 
symbols. It can address each image referred in the Document and simply add to its 
description new structured information as an XML fragment. Whereas the features 
data increase, their manipulation is not more difficult because the document is the 
only entry for their access. The knowledge base is then complete. 

• The classification part, which consists in creating learning and test bases from the 
dataset, and classifying samples from the test base. The bases are randomly gener- 
ated and their contents are simply labeled without any physical cut of the knowl- 
edge base. The samples of the test base are then classified using A KNN operator, 
to finally score the recognition process. The data formats are automatically adapted 
to the treatments thanks to XPath expressions. Performance evaluation consists in 
comparing results of recognition and input labels for the symbols. 
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In this application, the whole classification chain has been implemented using an 
image dataset and some pre-defined features. We have shown in this section that it is 
quite easy to enlarge the feature set by applying other feature extraction tools, and 
various specific treatments. This enlargement adds new features, which can be in 
various representation formats because they can be adapted to the classification- 
processing tool. 

4.3 A Scenario for Mixed Text/Graphics Documents 

The third application that is presented here performs the analysis of documents with 
mixed text/graphics content. Its aim is the separation of the graphical part from the 
textual part. The steps of the scenario are ordered as follows: 

• Binarize the image, 

• Perform text-graphics separation, 

• Perform segmentation into text blocks on the textual layer image, 

• Perform segmentation into connected components on the graphical layer image, 

• Perform segmentation into connected components on the parts that have not been 
classified as graphics or text, 

• Correct text-graphic separation errors by a cross-examination of the results of these 
three segmentations, 

• Perform OCR on the text blocks of the resulting textual layer image, 

• Perform vectorization on the resulting graphical layer image. 

The main DPUs available to run the scenario are described in [2] and the construc- 
tion of this scenario relies on a structural combination of these DPUs. It has been 
made possible by associating each processing with a contract describing its behavior. 
This contract includes processing instructions as Xpath expressions, which are inter- 
preted by the ImTrAc engine. ImTrAc extracts the resulting element set from the 
document and transmits it to the processing. Such a scenario may be used for different 
purposes.lt may become part of a more general document interpretation system.lt may 
also be used to evaluate the robustness of an algorithm in case of noisy input images 
(leading to text-graphics separation errors). Finally, it may be used as a benchmarking 
tool: When a developer implements a particular step of the scenario, he may run the 
different available DPU to evaluate the efficiency of his implementation. 



5 Conclusion 

DocMining is a multi purpose platform and is characterized by three major aspects. At 
first, its architecture relies on a document-centered approach. Document processing 
units (DPUs) communicate through the document itself; such an approach avoids the 
problems of data scattering usually met in classical document processing chains. Sec- 
ond, the DocMining framework is based on a plug-in oriented architecture. Developers 
can conveniently add new DPUs, thus making the platform easily upgradeable (for 
example in order to process color documents). Document visualization and manipula- 
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tion tools are also designed according to this approach, so that a user is able to fully 
customize the interactions with the document structure. Third, the platform handles 
scenario-based operations. Running a scenario collects user experience, which be- 
comes part of the scenario itself. The scenario may then be transformed into a new 
DPU, corresponding to a higher-level granularity. Thus, the DocMining architecture is 
really modular because a user can create his own objects, integrate his own DPUs into 
the platform, design his own interfaces, define and run his own scenarios. 

More than a software environment, DocMining must therefore be considered as a 
general framework for integrating various tools, DPUs and systems. The aim of the 
framework is to be flexible and general enough for building various kinds of document 
analysis systems (documents can be black and white, grey or color)- actually, we have 
shown that it can even be extended to more general pattern recognition systems. 

We are especially convinced of the usefulness of the DocMining framework for set- 
ting up and running performance evaluation and benchmarking campaigns or contests. 
As shown in the use-cases presented in this paper, it is fully possible to easily build 
scenarios including taking into account some ground-truth, computing performances 
by using some metric, or even matching the recognized entities with the ground-truth 
- in this case, the matching would simply be another DPU to be integrated in the 
chain. Work remains to be done on defining a general “roadmap” to build such evalua- 
tion campaigns, but we hope to be able to offer the power of our software framework, 
in the coming months and years, for the benefit of various performance analysis cam- 
paigns within the pattern recognition community. 
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Abstract. We present a system for automatic FAX routing which pro- 
cesses incoming FAX images and forwards them to the correct email alias. 
The system hrst performs optical character recognition to find words and 
in some cases parts of words (we have observed error rates as high as 10 
to 20 percent). For all these “noisy” words, a set of features is computed 
which include internal text features, location features, and relationship 
features. These features are combined to estimate the relevance of the 
word in the context of the page and the recipient database. The pa- 
rameters of the word relevance function are learned from training data 
using the AdaBoost learning algorithm. Words are then compared to the 
database of recipients to find likely matches. The recipients are finally 
ranked by combining the quality of the matches and the relevance of the 
words. Experiments are presented which demonstrate the effectiveness 
of this system on a large set of real data. 



1 Introduction 

Companies may receive hundreds or thousands of FAXes per day. While many 
are printed by a conventional FAX machine, a growing number will arrive at 
computers equipped with FAX modems or through an internet FAX service. 
One natural mechanism for delivering these FAXes is as email with an attached 
FAX image file (such as TIFF). Incoming FAX images lack digital information 
about the destination email address (though they may include a small amount 
of digital information about the sender). Routing the FAX to the correct email 
address must be performed by hand, by a FAX secretary that examines each 
FAX image. Though the cost of a digital FAX system is significantly less than a 
paper FAX system, the time required for routing FAXes in a large organization 
can lead to significant costs. 

This paper describes an automatic system for routing a FAX to a set of 
incoming addresses. The process proceeds in several steps: the text on the FAX is 
recognized using an optical character recognition algorithm, the text is examined 
to “spot” candidate words which are likely to be relevant to the addressee’s name, 
and finally all relevant candidate words are matched to the database of email 
addresses to determine a set of likely matches. 

2 The FAX Routing Task 

We begin with a description of the task and some observations about typical 
FAXes. 



S. Marinai and A. Dengel (Eds.): DAS 2004, LNCS 3163, pp. 484—495, 2004. 
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1 . We are given a database of email addresses which also includes the first and 
last name, and optionally a “full name” which is sometimes different from 
either the first or last name (i.e. the full name can be used to disambiguate 
people with the same name, and it is used for mailing lists such as “Pur- 
chasing Department” or “Legal Affairs”). In our experiments this database 
included 15,000 recipients. 

2. Incoming FAXes come from a very wide range of sources. The quality of 
these documents is highly varied: 

(a) Most have a structured cover sheet. 

(b) A number are “FAXed back” and actually have the FROM and TO fields 
reversed (they are intended to be returned to the sender of the original 
FAX). 

(c) About 40% have handwritten recipient information. 

(d) Low image quality can lead to Optical Character Recognition (OCR) 
error rates greater than 20%. 

(e) The format of printed cover sheets is often tabular. Since OCR algo- 
rithms have difficulty with this type of organization, the extracted doc- 
ument structure (words/lines/paragraphs) is often unreliable. 

As mentioned above, in our experiments about 40% of the FAXes contain 
cover sheets on which the addressee’s name is written by hand. In this case 
conventional OCR cannot be used. On this point we make two observations. In a 
majority of these cases the cover sheet is a printed document. The printed words 
on the cover sheet can be used to predict the location of the addressee name, 
even if the name cannot be recognized. In addition we are currently working to 
develop and then integrate handwritten text recognition system. In this paper 
we will assume that the recipient information is printed and we will 
limit discussion to these FAXes. 

At first it may seem as though the FAX routing problem is easy. Our first 
approach was simple and entirely ad-hoc. The successes and failures of this 
system motivated our current effort. In this initial approach, the OCR text 
stream was searched to find anchor keywords such as “To” or “Attention” . The 
immediately following text was then compared to the email address database to 
find matches with the either the email address or name. Using this technique we 
were able to correctly route approximately 52% of the FAXes received. While it 
was possible to propose alternative email hypotheses, this system did not provide 
an estimate that the given answer was correct. 



To: Bob Smith 

From: Jim Jones 



To: 


From: 


Bob Smith 


Jim Jones 



Fig. 1. On the left is a “good” address, which after OCR is easy to interpret. On the 
right is a more difficult address block, which after OCR requires additional geometric 
processing. 
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We found several sorts of problems with the above system: 

1. Missing anchor issues: 

(a) The anchor word was unusual (e.g. “care of”). 

(b) The OCR text stream separated the anchor word from recipient infor- 
mation (e.g. Figure 1)). 

(c) The anchor word is missing entirely. 

(d) The anchor word was corrupted by OCR errors. 

2. Poor name matches: 

(a) Name misspelled. 

(b) Name corrupted. 

(c) Name incorrect (e.g. last name is hyphenated in database but on FAX 
it is one of the two names). 

We address the first problem in part by building a robust and flexible sta- 
tistical word spotter that assigns a relevance score to each word on the page. 
Words with high score are more likely to contain relevant recipient information. 
The word spotter is a combination of simple word features, which is learned from 
example FAX data. The word spotter can often spot recipient information when 
all of the above “missing anchor” issues arise. In a large collection of FAXes the 
correct recipient information can be spotted more than 95% of the time. 

We address the second problem by using an efficient yet robust string match- 
ing algorithm. In addition several heuristics are used to deal with matching parts 
of names. 

Finally, information from throughout the FAX is integrated to collect the 
most robust possible evidence. So if the FAX starts out with a partially corrupt 
address block (e.g. “Te: Bob Smuth), and the body of the FAX contains “Dear 
Mr. Smith”, evidence is integrated to conclude that the FAX is for “Bob Smith”. 

3 Related Work 

In their paper, Lii and Srihari show that the “address block” of a FAX can 
be extracted using keywords such as TO and ATTENTION [I]. The keywords 
themselves are found using two heuristics: that they are terminated by a colon 
or that they are proceeded by a large white space. The position of the keywords, 
and surrounding words, are used to find the rectangular region which contains 
the address block. 

The FAXAssist system routes FAXes by matching all words in the document 
to the recipient database using “string edit distance” [2]. The full names in the 
database are processed to yield common forms such as LastNaune, FirstNaune 
and Firstinitial . LastNaone. The match score for each word is further modi- 
fied using a model for the likely positions of the recipient name. 

The name extraction program of Likforman-Sulem, Vaillant, and Yvon uses 
a collection of features to represent each word on the page [3] . These features are 
both internal and external. Internal features include tests to see if the word is 
capitalized, a common word, a common first name, etc. External word features 




Automatic Fax Routing 487 



are defined by decomposing the document into blocks. Those blocks which are 
near words from the recipient class such as TO and ATTENTION are labeled 
as potentially being a “sender block” . The recipient block is detected similarly. 
Words in the sender block have the “sender block feature” (likewise with the 
recipient block) . The overall set of features are combined linearly using a set of 
hand chosen weights. 



4 Recipient Information Spotting 



The key component of the recipient spotting process is a word scoring function 
which assigns a score to each word on the FAX page. Machine learning is used 
to train this function to minimize the error evaluated on a large set of training 
examples. The scoring function is expressed as a sum of simpler binary word 
functions which depend on: i) the text of the word; ii) the location of the word; 
and iii) the spatial relationship to other words. Each feature function has the 
following form: 



fj{w) G 



J aj if the feature is true 
( [3j otherwise 



( 1 ) 



Typical examples of binary word functions include: 



— Is the word equal to the string “Mr.”? 

— Does the word include the substring “.com”? 

— Is the word more than 7 inches from the top of the page? 

— Is the word within 0.5 inches of the word “Attention”? 

— Is the distance to the nearest word greater than I inch? 

The final word score is computed as ^ fj (w) . 

3 

Clearly there are many potential binary word functions; we propose a pro- 
grammatic technique for generating a large set of these combinatorially. In our 
experiments more than 2000 of these simple binary features are generated before 
learning is used to select a small set of critical features and to estimate aj and 

Many of the binary word features are based on an underlying continuous 
word filter (the term “filter” is chosen to emphasize the continuous nature of the 
response). Examining the features listed above, three are based on filters with 
the addition of a threshold: 



— Distance of the current word from the top of the page. 

— Distance of the current word from the word “Attention” . 

— Distance of the current word to the nearest word. 

A very large number of filters and features are generated combinatorially 
from training data and a few basic principles: 
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1. Word text features: 

(a) One feature is generated for each commonly occurring word in the train- 
ing database. The feature is true if the FAX word matches this word. 

(b) A single binary feature is generated from all of the words in the email 
database. The feature is true if the FAX word matches one of these alias 
words. 

(c) The presence of a common substring from the training data (e.g. “.com” 
or 

2. Location filters: X location, V location, width and height of the bounding 

box. 

3. Relationship features: true if a common word is within a given distance D 

which may be 200, 300, or 400 pixels. 

(a) True if word is within D and directly to the left. 

(b) True if word is within D and directly right. 

(c) True if word is within D and either left or above. 

4. Relationship filters: 

(a) Distance to the nth nearest word. 

(b) Distance to the nearest word which matches a common string 

(c) The number of words on the current line. 

Using the training database to select the 200 most common words and ex- 
panding each of the above principles leads to 2128 filters and features. 

The AdaBoost machine learning framework is used both to select a subset 
of the available features, including the filter thresholds, as well as feature scores 
aj and (3j [4] . AdaBoost is an extremely simple algorithm which is nevertheless 
a very effective feature selection mechanism and an efficient learning algorithm. 
AdaBoost proceeds in rounds; in each round the “best” new feature is selected 
and added to the classifier. Typically AdaBoost is run for a few hundred rounds 
to yield a classifier which depends on a few hundred features. 

To summarize, the overall framework is to first generate a very large set 
of features which are related to the classification process, and then second to 
use AdaBoost to select a small set of these features so that the final classifier 
is effective and computationally efficient. This basic framework was introduced 
by Tieu and Viola in their work on image database retrieval and then used by 
Viola and Jones to produce an extremely fast and effective face detection system 
[5,6]. 

4.1 Training the Classifier 

The word relevance classifier is trained using a set of labeled data: a set of FAXes 
upon which OCR has been run, and in which the recipient information has been 
highlighted. In our experiments we use the ScanSoft SDK [7]. Each word on the 
FAX is assigned a label -1-1 if it contains information relevant to the recipient 
identity, and —1 otherwise. 

For each FAX in the training set the location of the recipient info is noted 
by drawing a bounding rectangle, or set of rectangles on the FAX. The rectangle 
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is intended to “highlight” the relevant recipient information. During training, 
the particulars of the rectangle are discarded. The only information retained is 
the identity of the words that lie within the rectangle. In practice, there is no 
attempt to label all relevant words. On each FAX only the most relevant words 
are highlighted (usually the contents of the “TO” field). Given that there may 
be over one hundred words on a FAX page, most of the words are not relevant. 



Table 1. Table showing the top scoring features selected by the AdaBoost process. 



Score 


Feature 


5.92 


Word in current bounding box is in “recipient alias” 


4.45 


Word in current bbox is a human name 


3.59 


Word “To” appear on the left (up to 500 pixels) 


2.93 


Word in current bounding box is “COMPANYNAME” (binary) 


2.72 


Word “Attn” appears on the left (up to 500 pixels) 


2.37 


The string appears in the current word 


2.31 


Word in current bounding box is “Business” (binary) 


1.68 


’Confidence of the word in current bounding box’ < 36 


1.61 


“Mr” appears on the left (up to 500 pixels) 






-5.41 


1601 < Y co-ordinate of the center of the bounding box 


-4.42 


The word “Phone” appears on the left (up to 500 pixels) 


-3.40 


The word “From” appears on the left (up to 500 pixels) 


-3.39 


1421 < Y co-ordinate of the center of the bbox < 1600 


-3.32 


Distance of 4-th nearest word < 72 


-2.52 


1537 < X co-ordinate of the center of the bounding box 


-2.20 


496 < Width of the bounding box 


-1.90 


1157 < Y co-ordinate of the center of the bbox < 1420 


-1.86 


877 < Confidence of the word in current bounding box 



The word scoring algorithm is trained to predict which words appear in the 
recipient rectangle. The learning problem is therefore binary, each word is given 
a label of +1 if it is in the rectangle, and -1 otherwise. As described above, 
AdaBoost selecting from a set of word features is used to predict this label. 
After training on a set of 2221 FAXes, the correct label is predicted 95% of the 
time on a separate set of test data. The top features selected for the classifier 
are shown in Table 1. AdaBoost assigns two scores, or votes, to each selected 
feature: a if the feature is true and j3 if false. Since each feature is either true or 
false is is useful to consider the net score which is a — f3. We sort the features 
by these net scores. 

Positive net scores are associated with relevant words, and negative with 
irrelevant words. Not surprisingly the most important feature tests if the word 
is in the recipient alias database. Another expected positive scoring feature tests 
for the presence of the word “TO” nearby and to the left. The negative features 
are equally interesting, but less predictable. The most negative states that it is 
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bad to be near the bottom of the page (i.e. 1601 dots assuming 200 DPI is 8 
inches). Other negative features penalize words for being near other labels, like 
“PHONE” and “FROM”. The feature Distance of 4th nearest word < 72 
states that if there are 4 words within 0.36 inches, then the word is less likely to 
be relevant. Essentially, words in the middle of a paragraph are less likely to be 
relevant. Note that our training set included FAXes with handwritten addresses 
(about 40%). Often the recipient address was the only handwritten word on the 
page. This led the system to conclude that words with a low OCR confidence 
(as handwritten words often are) were more likely to be the address. 

The signed scores assigned by AdaBoost are not directly interpretable as a 
probability. It is not difficult, however to estimate a probability as a logistic 
function of the score. The parameters of this logistic are estimated using logistic 
regression on a held out set of labeled examples (held out from the training 
set). The resulting quantity, which ranges from 0 to 1, is approximately the 
probability that the word is relevant to the recipient identity. 

The final classifier yields very good coverage of the FAXes: since a recipient 
name typically contains two words, the chance of missing both words is very 
low. There are typically 1 or 2 false positives and generally less than 5 or 10 
false positives on a given FAX. The set of relevant words are then matched to 
the database of alias words as described below. The top scoring words for two 
typical FAXes are shown in Figures 2 and 3. 



5 Recipient Information Matching 



In order to robustly match in the presence of OCR errors, we have chosen to 
measure the “string edit distance” between words in the alias database and words 
in the FAXes. The string edit distance between two strings measures the number 
of characters that must be added to the first string, deleted from the second, 
or substituted. For example, the string edit distance between “CAATE” and 
“CAR” is two deletions and one substitution. Based on observation of typical 
OCR errors, separate costs can be assigned to deletion, addition, and substitution 
errors. 

For use in the FAX routing process, the string edit distance is converted 
to a match score which ranges from zero to one. We consider this score to be 
analogous to a probability of the corrupted word given the true word. While we 
have considered estimating this probability function from data, in this paper we 
use a surrogate function which yields good results. We define the match score as 



m{wi,W 2 ) = exp 




dist{wi,W2) 
maxdist{wi , W 2 ) 



(2) 



where dist{) is the string edit distance and maxdist{) is the maximum string 
edit distance between two strings of this length. Given equal costs for inser- 
tion/deletion/substitution the maximum distance is the length of the longer 
string. 
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Rank 


Word 


Score 


Probability 


1 


JRINKER 


3.6206 


0.632152 


2 


MICROSOFT 


0.4270 


0.116035 


3 


GEICO 


-0.8704 


0.044132 


4 


Page 


-0.9962 


0.040053 


5 


14257087329 


-1.3000 


0.031635 



Fig. 2. A simple example FAX which shows the top 5 scoring words. Each word is 
emphasized by enclosing it in a rectangle (not part of the original image), the word 
rank is shown as a circled nnmber. 



String edit distance has a well known dynamic programming solution, which 
leads to an algorithm with 0{NM) complexity, where N is the number of char- 
acters in the OCR word and M is the number characters in the alias word [8]. 
Given K words in the alias database the complexity is naively 0{KNM) to find 
the best matching word, a cost which may be prohibitive for large databases. 
While there is an extremely efficient algorithm for finding the exact match be- 
tween a word and a large database, a solution for the task of finding the best 
string under the string edit distance is not quite as clear. 

Branch-and-Bound Search. We choose a branch-and-bound search to find 
the best match, which relies on a cheap underestimate of the string edit distance 
which we call the “order invariant edit distance” . One guaranteed underestimate 
of the string edit distance is to ignore the component of the distance that depends 
on character order. It is computed in the following way. For each word compute 
a character occurrence vector. The word “CAATE” has two occurrences of A, 
one of C, one of E and one of T. The word “CAR” has one A, one C, and one 
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Rank 


Word 


Score 


Probability 


1 


Paul 


5.16044 


0.855890 


2 


Ecompanystore.com 


4.82982 


0.819846 


3 


Viola 


1.55514 


0.245645 


4 


P 


0.99124 


0.171346 


5 


hear 


0.94523 


0.166149 



Fig. 3. A second more complex FAX which shows the top 5 scoring words. 



R. The signed difference between these occurrence vectors is precisely related to 
the order invariant edit distance: one A and one E must be deleted, and the T 
must be substituted for an R. This distance is computed 0(N + M) time by first 
measuring the occurrences, and then subtracting the counts for each character. 

The branch-and-bound search algorithm is used as follows. Given a FAX 
word, compute the distance underestimate for each database word, and sort the 
database elements from least distance to greatest. Starting with the smallest 
distance database words, compute the true string edit distance for each and 
reinsert the word in the sorted list based on the true string edit distance (which 
is always greater than or equal to the underestimate). The first example which 
is encountered twice (first using the underestimate and then later using the true 
distance) is the closest example in the database. This is because the true distance 
for this example is less than the distance underestimate for those examples for 
which the true distance is not yet known, and less than the true distance for 
those examples for which the true distance is known. 
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The complexity of the matching process is no longer deterministic, since it 
depends on the words in the database. In practice the cost of the match of our 
database and other typical databases is approximately 0{K{N + M)), since the 
underestimates are fairly tight. 

6 Recipient Information Integration 

We have described a scheme for “spotting” the relevant words on a FAX, and 
a mechanism for matching strings efficiently. Using these modules a set of N 
relevant words can be extracted from the FAX. Each of these words can be 
matched efficiently to the database, yielding a set of alias words which match 
“well” . The remaining issue is integration of evidence from these various sources. 
For this problem we propose two algorithms. 

The first algorithm is called “simple weighted score” . Each email address in 
the database is assigned a score using the following formula: 

s(a) = r(w)m('w, a) (3) 

W 

where s(a) is the score for the alias, the summation is over all words in the 
document, r(w) is the relevance of the word, and m(w, a) is the best match 
between the FAX word and one of the text fields in the alias entry (first name, 
last name, email address, or full name). A simple example: if the string “Bob 
Smith” appears in the FAX, the evidence for the alias “bob_smith” is good 
because the word “Bob” votes for the first name of the user bob_smith, while 
the word “Smith” votes for the last name. Note that the word “Bob” also votes 
for the alias “bob_riley” and “bob_dean” . 

The greatest flaw in the simple weighted score is its assumption of word 
independence. The score for the alias bob_smith is the same if “Bob Smith” is 
observed or alternatively if “Joe Smith” and “Bob Jones” appear separately. 

The second proposed matching algorithm is called “contiguous weighted 
score” . It attempts to model the interdependence of words in a FAX document 
as follows: 

s(a)= 51 Ciriwt), r(wt+i))m(a,wt,wt+i)+j^ r(w)m(a,w) (4) 

Wt,Wt + l w 

where Wt,Wt+i are contiguous words in the FAX, and (7() is a function which 
combines the confidences of the two words (e.g max, sum, or product), and 

' m(’flrst last’, “wtwt+i^) 
m(’last first’, Wt+i”) 

m{a,Wt,Wt+i) = max < m(full_name,“wt Wt+i”) (5) 

m(flrst,“wt wt+i”) 
m(last,“wt wt+i”) 

Returning to the above example, the FAX string “Bob Smith” now yields a 
higher score for bob_smith, since there is an additional bonus for matching a 
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Fig. 4. A screen dump of the final system operating on the two FAXes shown above. 
In this case the top three recipient addresses are shown. 



contiguous pair of words. In addition, the last two terms match a pair of FAX 
words to a single alias entry. This helps in situations where the word boundaries 
found by OCR are not reliable. 

Email aliases that appear directly in a FAX message may require special 
handling, since the domain may or may not appear (e.g. “pviola” and “pvi- 
ola@microsoft.com” are equivalent). This is easily handled by adding the alias 
to the database of alias strings both with and without the appended domain for 
the receiving organization. 

7 Experimental Results 

Final routing experiments were performed on a set of 2455 business FAXes re- 
ceived at one company over a period of several months (see Figure 4 for a screen 
dump). The FAXes varied in type, including personal FAXes intended for em- 
ployees, purchase orders, and forms filled out by vendors and clients. There were 
a total of 15,000 addresses in the email database. The distribution followed a 
predictable Zipf law, in which a large percentage of FAXes were sent to a few 
of the addresses. In this set of FAXes, there were a total of 723 email addresses 
(though this was not assumed during testing). 

The set of FAXes were randomly separated into two sets of 2221 for “training” 
and 234 for “testing” . The set used for testing was random selected from those 
which were not addressed by hand. On the testing set 95% of the FAXes were 
routed to the correct recipient. In this experiment that match score of equation 4 
is used. The relevance function is computed using the score assigned by the 
boosted classifier after 300 rounds of boosting. 
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8 Conclusions 

The main contribution of this paper is to present a single unified machine learn- 
ing process which can used to estimate the relevance of each word on a FAX 
page. The learning algorithm removes most of detailed engineering and hand 
tuning required of the system builder, since it optimally combines word text fea- 
tures, word relationship features, and global features such as the location on the 
page. As a result this system is more likely to be applicable to related problems, 
such as the extraction of information from other types of scanned documents. 
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Abstract. This paper introduces a novel approach for dynamic structuring of 
contextual lattices. It is anticipated that the approach can he applied to improve 
the accuracy of word-segmentation patterns in autonomous text recognition sys- 
tems. A multi-level hierarchical structure of lattices is used to implement the 
algorithm, and the approach can be applied in a generic manner to other pattern 
recognition problems. We apply a top-down structural model in parallel with a 
constrained probabilistic model and intelligent distributed searching paradigm. 
This paradigm is based on the integration between probabilistic bi-grams and 
adaptive intelligent swarm-based agent search to identify the most likely sen- 
tence structures. The searching paradigm allows the exploitation of positive 
feedback as a search mechanism and, consequently, makes the model amenable 
to parallel implementation. The distributed intelligence of the proposed ap- 
proach enables the dynamic structuring of contextual lattices and has proved to 
scale well with large lattice sizes. Moreover, we believe that the proposed archi- 
tecture solves the i7/-conditioned nature of most pattern recognition problems 
that lies in the effect of noise in the segmentation phase. To verify the devel- 
oped Swarm-based Intelligent Search Algorithm (SISA), a simulation study was 
conducted on a set of variable size scripts. The proposed paradigm proved to be 
efficient in identifying the most highly segmented patterns and also returned 
good decisions concerning lower probability segments enabling further re- 
segmentations and re-combinations to take place. The paper is the first to apply 
the intelligent swarm-based paradigm for the identification of optimal seg- 
mented patterns in contextual recognition models. The algorithm is compared 
with other algorithms for the same problem, and the computational results dem- 
onstrate that the proposed approach is very efficient and robust for large-scale 
statistical contextual-lattice structures. 



Introduction 

Sentence segmentation has become an integral part of intelligent text recognition and 
the optimization of large vocabulary models. However, sentence segmentation may 
include false segments (or sub-words) as a result of the misidentification of segmenta- 
tion points. The existence of these false words leads to sub-optimal or inaccurate 
solutions, and severely reduces the recognition rate. Therefore effective identification 
and reconstruction of false paths (or sentence words) corresponding to false sub seg- 
ments has the potential to improve the segmentation accuracy. Extensive search has 
been carried out on the problem of identifying false paths that arise in some applica- 
tions, such as time analysis and circuit optimization in large digital IC designs. Al- 
though the complete identification of all such false paths in a circuit or network is an 
NP-complete problem, a number of heuristic or approximate methods have been pro- 
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posed [1,2]. The search process is the most important and challenging part of our 
optimization strategy. The likelihood of different segmentation patterns is computed 
using scores on the feature model. The search paradigm has then to efficiently choose 
patterns with the highest likelihood. The number of possible hypotheses grows expo- 
nentially with the number of features and imposes heavy constraints on the computa- 
tion and storage requirements. Therefore intuitively obvious techniques such as an 
exhaustive search are not at all practical and strategies that save on computation by 
modifying the search space are vital to achieve efficient and accurate performance. 

In character recognition dictionary or n-gram constraints [3] will restrict the allow- 
able combination of characters in forming words, while grammatical constraints will 
limit the assembly of words into sentences. Lack of success in some pattern recogni- 
tion problems can be a result of the problem being ill conditioned [4]. An example of 
the ill-conditioned nature of pattern recognition lies in the segmentation stage. The 
noise removal stage may apply smoothing for example, which reduces noise, but may 
result in adjacent symbols touching one another. Where white space is used to seg- 
ment putative characters this small change will result in an incorrect glyph that will 
fail to be recognized correctly. In practice no segmentation method is sufficiently 
reliable for handwritten document recognition, and some symbols will be merged 
while others are broken. These errors are difficult to recover from in a pipeline pattern 
recognition architecture. If many putative segmentations are passed forward in order 
to increase the likelihood that the correct one is amongst them, then a combinatorial 
explosion of possibilities is likely to render solution space intractable. 

In this paper, we introduce a new architecture for contextual modeling using multi- 
layer lattices in parallel with an intelligent distributed searching paradigm for the 
optimization of the contextual model. A new adaptive intelligent swarm-based agent 
approach is proposed for the optimization of this lattice structure where good sugges- 
tions can be fed back to the segmentation layer to adjust the segmentation points. The 
constructed multi-layered lattice is dynamically stretched and/or shrunk according to 
improvements in the segmentation process until a more optimal structure is obtained. 
The proposed approach can be applied to find the best sub-lattice structure of L-best 
sentences from segmented entities formed from low-level processing. The proposed 
iterative intelligence, swarm-based agent model is a first applied to this class of prob- 
lems. From experimental results we believe that our proposed approach holds promise 
in improving the segmentation process and is robust and dynamically adaptable to any 
changes in the contextual model. The paper is organized as follows. In the next sec- 
tion, we introduce the contextual model based on a multi-layer lattice and illustrate by 
describing the representation of the segmented words using attributed word-lattice. 
graph. Next, in section 3, we introduce the proposed swarm-based intelligent search- 
ing paradigm. The dynamic re-structuring of contextual lattices is explained in sec- 
tion 4. Experiments and simulation results are presented in section 5 with an analysis 
of the performance of the proposed algorithm. Finally, a summary and conclusions 
are given in section 6. 

Contextual Modeling Using a Multi-layer Lattice 

In order to explain the model, a simple example is illustrated by a tiny subset of pos- 
sible English sentences as follows: (1) The cat sat on the mat, (2) The boy threw the 
ball. 
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In this domain we have only ten words, three of which are the word “the”. The a 
priori probability of “the” is therefore 0.3, while that of the other words is 0.1. A 
transition matrix can also be derived with the probability of any pair of words appear- 
ing contiguously. We introduce special tokens to mean before the start of, and after 
the end of a sentence, and include these in our bi-gram probabilities. These tokens 
might be termed top (1 ) and bottom (_L) in the context of forming a lattice as shown 
in figure 1 . In general a lattice implements a partial ordering based on the < relation- 
ship. Our lattice refers to the ordering or words in a sentence and is in fact a total 
ordering (that is on <) as it is not possible for two words to share the same position in 
a sentence. 

A similar lattice may be constructed for letters as shown in figure 2. Consider an 
imperfect segmentation process that takes an image and divides it into a sequence of 
glyphs which it is hoped correspond to characters, and which also identifies word 
boundaries (A). Each of the glyphs is processed by a classifier, which returns a set of 
character classes, each with an associated probability. Every sentence of characters 
between word boundaries is applied to a character lattice of the type shown in figure 
2. This prunes character combinations that could not form words in the model. If no 
valid word can be formed then the system backtracks to the segmentation stage. 
Where words can be formed that obey the lattice constraints, these are assigned a 
probability according to the recognition, letter and letter bi-gram probabilities. That 

~ n (Pi g) n 

Where r, is the recognition probability of a letter, is the a priori probability of 
that letter, and is the bi-gram probability of the letter in position / and the following 
letter. This expression is normalized to remove the effects of word length and scaled 
to a convenient magnitude. The system backtracks to the segmentation stage, and the 
process is repeated for the affected words. This process will usually result in groups 
of words that have a probability of being correct. These constitute our top-down cue, 
and we hypothesise words based on the most likely words in the model. This process 
will suggest segmentation and classification mistakes that can be corrected, and the 
process repeated to try to establish further results that are consistent with those se- 
quences already found. 



Representation of the Segmented Words Using Attributed Word- Lattice Graph 

Word strings (or sub segments) which are resulted from different segmentation likeli- 
hood patterns, are represented as a bi-gram attributed word-XaWice. graph Lq(J S, T, 
A, J.), where: 
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1: token mean before the start of sentence. 

S: the set of all states in the lattice, where a state represents a word (or sub-segment) 
string, |S| = N. 

T: the set of bi-gram transition probability between each two words. 

= {ifij ^ ij ^N}, where Tj- is the bi-gram probability of observing a 
word j given that it is preceded by the word i. 

A: the set of states probability pairs , A = {aj,a 2 ,a 3 ,...,aj,j}, and Uj = 

/ ^/m), , where : 

^j(u) is the recognition probability, and t?/M) is the a priori probability of 
that word. 

J.: token mean after the end of sentence. 




Fig. 3. An example of a fully-connected Word-lattice graph of 14 words 



Swarm-Based Intelligent Search 

for the Optimization of Contextual Lattices (SISOCL) 

Overview of Swarm Intelligence (Ant Colony Systems ACS) 

Swarm intelligence originates from the work on the emergence of collective behaviors 
of real ants [5-7]. Ant Colony Systems (ACS) is a particular heuristic of ant colony 
optimization (ACO), one of the nature-inspired meta-heuristics to the solution of 
discrete optimization problems. The first ACS was introduced by Dorigo [8,9], which 
is termed the ant system (AS). It is the result of research on computational intelli- 
gence approaches to combinatorial optimization problems. By laying down different 
density trails of an attracting substance called a “pheromone”, ants become able to 
discriminate between food sources of different routes and qualities. Sensing of 
pheromone trails is used as a mechanism for the indirect communication among indi- 
viduals regarding paths, and used to make routing decisions. In ant colony-based 
algorithms, a set of artificial agents moves on the graph which represents the instance 
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of the problem. While moving they build solutions and modify the problem represen- 
tation by adding collected information. This process continues until almost every 
agent will eventually choose the same optimal path. Studies have shown that agents 
learn about the state of the environment continuously and very effectively, and can 
search for specific items of information in a large search space. In addition, the artifi- 
cial ants are equipped with a local heuristic function to guide their search through the 
set of feasible solutions. 



The Proposed Swarm Agent Model 

The objective of our searching paradigm is to find the top L-best sentences in the 
contextual segmentation model and return good decisions for dynamic re-structuring 
of the contextual worcf-lattice. The main concept in the algorithm is based on generat- 
ing populations of swarm agents able to navigate in the search space in a distributed 
manner and intelligently find the best sub-lattice structure representing the top L-best 
sentences. Two types of swarm-based agents are proposed in our model: the Forward 
agent (F-agent) and the Backward agent (B-agent). The Forward agent (F-agent) is 
modeled as a small moving object, handling a small stack memory and capable of 
exploring the search space based on the local knowledge of the neighbouring nodes, 
and which can apply a local updating process to the environment. At any time step, 
the F-agent at node i has to choose a neighboring node j to move to. It samples a ran- 
dom number q. If q < , then the best forward word-node is chosen (exploitation) 

according to equation (1), otherwise a word neighboring node is chosen according to 
equation (2) (bias exploration): 



arg 



j = i 



max {u)~\P j" if 

M6 5 yt 



J 



otherwise 



where: 



T(i,u) is the pheromone trail of edge (i,u), 

7](i,u) is the bi-gram probability from word node i to node u. 

Sf,(i) is the set of nodes that remain to be visited by agent k positioned on 
node i (to make the solution feasible), 
if/(u) is the word-node probability calculated as follows: 

XjA^u) = qfu) . t3(u), where qfu) is the probability of recognition of 
candidate word u, and z^m) is the probability of frequent use in the 
BNC dictionary. 

/? is a parameter which determines the relative importance of phero- 
mone versus bi-gram cost (P>0), 
q is a random number uniformly distributed in [0,1], 
qg is a parameter ( 0 < qg < 1 ) which determines the relative importance 
of exploitation versus exploration, 

J is a neighboring word-node selected according to the probability 
distribution, called a random-proportional rule, given in the follow- 
ing equation: 



( 1 ) 
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Pk(i,i) = \ 



J,\v{i,u)Y\n{i,u).\i/(u)\P 

ue Sk (0 



if j^Sk(i) 



( 2 ) 



When building the agents tours, the chosen edges are guided by both heuristic 
information and pheromone information. The state transition rule resulting from 
Equations (1) and (2) favour the choice of nodes (word segments) connected by 
highly transition probability (highly correlated) words with a greater amount of 
pheromone. While constructing a tour, the swarm F-agent apply a local updating. It 
changes the pheromone level on its visited nodes (or edges) by applying the local 
updating rule as follows: 

t{iJ)=(l-p)t{i,j)+ptQ (3) 



where Tq is the initial pheromone level and 0 < /? < 1 is the pheromone evaporation 
parameter. The effect of the local updating rule is to make the desirability of edges 
change dynamically in order to shuffle the tour. If ants explore different paths, then 
there is a higher probability that one of them will find an improving solution than if 
they all search in the narrow neighbourhood of the previous best tour. Every time an 
ant constructs a path, the local updating rule will make its visited edges’ pheromone 
diminish and become less attractive. Hence, the nodes in one ant’s tour will be chosen 
with a lower probability in building other ant’s tours. As a consequence, ants will 
favor the exploration of edges not yet visited and prevent the convergence to a com- 
mon path. A Backward swarm agent (B-agent) is modeled in the same way; a moving 
object, handling a stack memory and which is able to apply a global updating. The 
global updating rule is performed after all swarm-agents in the population have com- 
pleted their tours. In order to make the search more directed, global updating is in- 
tended to provide a greater amount of pheromone to highly likelihood sentences and 
reinforce them. Therefore, only the globally L-best swarm-agents that found the L- 
best solutions (the most highly probable sentences) up to the current iteration of the 
algorithm are permitted to deposit pheromone. 

The B-agent (corresponding to each of the L-best tours) traverses the same rout of 
its corresponding F-agent in the opposite direction and the pheromone level is modi- 
fied according to the global updating rule which is proportional to the fitness of the 
corresponding tour. The fitness T of a candidate solution T is estimated as follows: 

T(T)=y. n n (p(u)-p{u) 

\/z/gT,z — 

where T](i,u) is the bi-gram probability (transition) from node i to node u, (p(u) is the 
probability of recognition of word string u, and z ^ m ) is the a priori probability of that 
word. Y is a scaling coefficient. The solution length is represented as the number of 
segmented words in the solution \|/. This length depends on the location of segmen- 
tation points in the proposed sentence. The main objective of the globally updating 
rule is to increase pheromone on word-nodes (edges) of the current L-best tours and 
decrease pheromone on other edges. The pheromone level is modified according to 
the following equation: 
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At = 



nw) 

0 



if the tour belongs to the global L — best sentenc 



(5) 



otherwise 



i{i, i)={^-o).t{i,j)+o At 



(6) 



where r(\)/) is the fitness of the corresponding sentence belonging to the L-best global 
best sentences (tours) found up to the current iteration. a(0<a<l)is the 
pheromone evaporation rate. 



The Dynamic Re-structuring of Contextual Lattices 

The proposed paradigm returns good decisions concerning lower probability segments 
enabling further re-segmentations and re-combinations to take place. The re- 
segmentation request of the intelligent SISOCL Layer can take three forms: 

- Split(*p). This feedback request ask for splitting a word-string p into two 
new word segments. An example is shown in figure 4. 

- Merge(*pl, *p2). This request ask for the generation of a new word corre- 
sponds to merging two word strings pi and p2 (figure 5 ). 

- Adjust(*pl, *p2). This request ask for the generation of two new word- 
nodes resulting from adjusting the segmentation point between two words 
strings pJ and p2. An example is shown in figure 6. 




Fig. 4. (a) intermediate node v the Splitting 
of node v into v’and v” 




Fig. 5. (a) intermediate nodes e and /, (b) node 
X resulting from Merging e & f 





Fig. 6. (a) intermediate nodes e and /, (b) nodes e’ and /’ are resulting from Adjusting the seg- 
mentation points between e &/ 
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Experiments and Results 

In this section we present the results of our main experiments. We show that the pro- 
posed iterative method of jointly optimizing a lexicon, segmentation, and language 
model not only results in better sentence segmentation but also improves the reduc- 
tion in the search space of the hypothesis model. The Word-Lattice models are trained 
using a dataset of 6,318 words with more than 800 occurrences in the whole lOOM- 
word British National Code (BNC). Sets of electronic texts (e-texts), which are pro- 
duced by Gutenberg Project (one of the Internet’s oldest producer of free electronic e- 
texts) are used in the training process. The first part of our work, we have designed 
and produced a toolkit, which is a set of C-H- software programs designed and imple- 
mented to facilitate textual document modeling work. Some of the designed tools are 
used to process general textual data into: word frequency lists and vocabularies, word 
bi-gram models, and bi-gram-related statistics. For the purpose of experimentation 
with the proposed searching paradigm we designed our own vocabulary language 
model and care has been taken to construct these test models so as to verify the per- 
formance of the search engine accurately. For the purpose of scoring, we used the a 
priori probabilities and the recognition probabilities of each segmented word (word 
segment) and the bi-gram transition probabilities that represent correlation between 
word segments [10]. Synthetic data is created by adding different hypothesis to each 
state in the model. To specify the improving in the quality of the solution, we used the 
probability factor g*, defined as: g* = q ( q - q’ ) , where q is the probability of the 
best solution up-to time of consideration, and q ’ is the probability of the 2"‘* best solu- 
tion. In figures 7 and 8, we compare the performance of our proposed algorithm with 
two randomized local search algorithms: Simulated Annealing (SA) and a hybrid two- 
Phase Optimization (2PO) [11]. 

Figure 7 tests the performance over the time. Both our proposed algorithm and the 
2PO algorithm converge faster than SA. In figure 8, we studied how well the algo- 
rithms scale with the problem size. In order to simulate situations where there is a 
limited time for optimization, each algorithm was terminated after a constant execu- 
tion time of 120 seconds even if it did not converge. For larger problems, SA is inef- 
fective since it does not have enough time to freeze. Although 2PO does not converge 
either, it still manages to find good solutions even for big problem sizes. In figure 9, 
we compare the running time of the proposed SISOCL algorithm with the stack de- 
coder A* algorithm [12], Viterbi algorithm [10,13], SA, and 2PO algorithms, for 
different lattice sizes. 

As observed from the figure, the swarm-based algorithm (SISOCL) is very robust 
for large graph sizes. Its running time grows with a smaller rate than the other com- 
pared algorithms. On the other hand, the running time of the 2PO algorithm grows 
with a smaller rate than SA. The ratio between the running time of SA and 2PO in- 
creases with the increase in the number of relations in the large lattice joins the word 
states. This can be explained by the fact that the 2PO starts with a local minimum. 
The systematic A* algorithm has a high run-time cost which limits its applicability for 
large problems. This can be explained by the fact that the A* algorithm works in a 
best-first search of the lattice looking for the highest probability path, and hence the 
highest probability sentence. So, as the number of states and relations increase, the 
running time grows exponentially, rendering exhaustive optimization inapplicable. 
Figure 10 shows the increase in the probability of the best and 2“*^ best solution corre- 
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Fig. 7. The change in the likelihood of the best solution versus simulation time for different 
algorithms in comparison (swarm-hased, SA and 2PO), for a word-lattice graph of size 35. 
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Fig. 8. Likelihood probability of the best solution versus problem size 



corresponding to dynamic changes in the word-lattice model relative to re- 
segmentation processes. As shown in the chart, we obtained fast improvement. This 
means that the proposed approach is intelligent enough such that coming back with 
good decisions for the re-segmentation points, and hence achieve fast improvement in 
identifying the best segmentation pattern with maximum likelihood. 

Figure 1 1 illustrates the changes in the relative probability versus the number of re- 
segmentations for different word-lattice sizes. As observed from the figure, the pro- 
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posed searching paradigm first converges to the most likelihood solution. Then, deci- 
sions for re-segmentations take place. This is represented as a rapid increase in the 
relative probability, rapidly reaching values close to 1 . So, the relative error decreases 
dramatically according to the dynamic re-configuration of the word-lattice structure. 
Good re-segmentation decisions are taken such that improvements of solutions takes 
place very fast even for large graph sizes. 




Fig. 9. Convergence time for different Word-\adice sizes 




o IE-77 
S IE-84 
ra 1E-91 
8 IE-98 
“■ IE-105 
IE-112 
IE-119 
IE-126 



0 2 4 6 8 10 12 14 

No Of re-segmentations 




— — Best • 2nd Best — g' 



16 



Fig. 10. Improving the probability of the Best segmentation pattern versus number of re- 
segmentation 



Summary and Conclusions 

A pattern recognition architecture has been proposed using hierarchical constraint 
lattices, backtracking, and a swarm-based search strategy. This has been implemented 
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and found to be highly effective and to scale well with large attributed lattice struc- 
tures. If the lattice consists of a limited number of states, application of systematic 
algorithms, like A*, is efficient. As the number of states increases, however, the run- 
ning time of these systematic methods grow exponentially, rendering exhaustive op- 
timization inapplicable. The main strengths of the proposed algorithm are its robust- 
ness and the intelligent nature of the agents in exploring new search areas. Another 
potential advantage of such a paradigm is the ability to explore new good solutions in 
large size graphs and dynamically adapt to the changes in the hypothesized model. 
The simulation study for different word-lattice graphs demonstrates that the proposed 
algorithm is highly robust and very efficient in the sense of yielding fast and high- 
quality solutions. The results show that the method can easily remove a large set of 
specific false sub-paths with excellent run time performance and that highly probable 
sentences are computed early in the search. 

100 




No. of re-segmentations 

— Size 35 nodes • Size 53 nodes Size 60 nodes — Size 70 nodes 



Fig. 11. Relative probability versus no. of re-segmentations for different word-lattice sizes 
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Abstract. Natural Language Processing techniques for text-mining and infor- 
mation retrieval are finding application in the analysis of many kinds of docu- 
mentation, from technical documentation to World Wide Web. Particularly, 
Functional Analysis techniques are based on the extraction of the interactions 
between the entities described in the document: these interactions are expressed 
as Subject-Action-Object (SAO) triples (obtainable using a suitable syntactic 
parser) which represent a concept in its most synthesizing form. In this work, 
the techniques developed for a functional analysis of patents and their imple- 
mentation in the PAT- Analyzer tool are presented. The same technique has 
been properly tailored and applied to the analysis of software requirements 
documents. Current work in the direction of the development of a SAO-based 
Content Analysis of technical documentation is presented. 



1 Introduction 

Text Mining and Knowledge Management technologies are assuming a key role for 
many organizations: in order to propose competitive products or services it is neces- 
sary to minimize the resources dedicated to the accomplishment of repetitive tasks 
and to focus on “creative” activities. Moreover, innovation is basically limited by 
psychological inertia on one hand, and by lacks of knowledge on the other. 

Therefore, information retrieval, documents classification, business intelligence, 
technology forecasting, competitors monitoring etc. are nowadays crucial activities 
requiring advanced tools capable to face the dramatic paradox that comes out from 
the availability of huge amount of data from a bewildering variety of sources: an 
overload of information means no usable knowledge. Such a contradiction between 
the width of the information source and the low usability of a large amount of docu- 
ments is typically met in patent search activities: monitoring competitors, checking 
the novelty of an invention or looking for technical solutions in other fields of appli- 
cation require big efforts even for skilled researchers. Performing a text analysis for 
constructing design representation has been approached by several authors, as in [8], 
by means of techniques mainly based on statistics (i.e. counting terms frequencies, 
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identifying specific words etc.)- The same approach is followed by tools specifically 
dedicated to patents analyses as [19, 26], but their main limit is that they cannot dis- 
tinguish the role of a component in a technical system. 

The commercial availability of software capable of analyzing documents with se- 
mantic processing algorithms offers a revolutionary way to search, summarize and 
classify information. Linguistic Analysis tools allow the identification of the key 
elements of a document, by combining morphological, syntactic and semantic analy- 
ses. 

Major results can be obtained by analyzing structured documents whose format is 
strictly related to the content. This characteristics is typical of several forms of tech- 
nical documentation, and allows “low-cost” Linguistic Analysis techniques and tools 
to be adopted. Patents are a typical example, with distinguished sections for claims, 
background of the invention, description of the preferred embodiment etc. 

The application of Linguistic Analysis techniques to patent documents, combined 
to the knowledge about patent format, allows patents analyses and comparisons [5, 6] 
to be speeded up, providing automatically a functional description of an invention. By 
means of syntactic parsers we can identify Subjects, Actions and Objects of a sen- 
tence (SAO triads). Subjects and Objects may refer to components of the system. 
Actions may refer to functions performed by and on components. We are interested to 
select those SAOs in which the Subject is the Tool performing a function on the Arti- 
fact referred by the Object, and the action is the Field that links the Tool and the Arti- 
fact (TFA triads). The successful application of Natural Language techniques to pat- 
ent documents and the development of a specific tool (PAT-Analyzer), described in 
section 2, has prompted the authors to investigate the portability of the approach to 
other forms of technical documentation. Indeed, in the field of Software Requirement 
Engineering, several attempts have been made in the same direction, and section 3 
surveys some of them. In this case, the aim of the analysis of technical documentation 
is the discovery of possible ambiguities or incompleteness on one hand, and the de- 
tailed comprehension of the requirements in the direction of a possible formalization 
on the other hand. In section 4 the paper presents the guidelines along which the ap- 
proach and tool developed for the analysis of patents have been adopted and modified 
for the analysis of software requirements documents, as well as a novel way of per- 
forming Content Analysis both on Patents and Software Requirement documents. 

2 Patents Functional Analysis 

Functional Analysis is a powerful tool for conceptual design both for problem identi- 
fication and innovative solutions generation: the functional description of a product is 
a description at an abstract level, so that different design solutions can be explored by 
developing functional variants. Moreover, functional analysis helps the designer in 
following a systematic approach also in the study of complex systems, by breaking up 
functions into simpler subfunctions and subdividing the problem into more manage- 
able parts. Finally, functional analysis can play an important role also in activities of 
patent breaking [12]. Inversely, when writing a new patent, a text-based functional 
analysis is an effective test of the suitability of the work done. 
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Further advantages can be obtained by performing functional analyses with a 
TRIZ-based way of thinking, i.e. representing not only useful functional relationships 
between system components, but also harmful, ineffective, excessive and missing 
functional relationships [16]. 

Functional modeling is used also at a detailed design stage: following the Suh ap- 
proach [27] the function is the desired output and the design is decomposed into func- 
tional requirements which are mapped directly with the design parameters at any 
abstraction level. 

Several works have proposed comprehensive representations of functions which 
represent the different aspects of the designers’ intention, that is a crucial issue for 
developing computer aided conceptual design systems; the aim of these works is 
defining effective ways to represent also the relationships among the functions, i.e. 
decomposed-into, conditioned-by, enhanced-by and described-as relations [22]. 

In this context the authors are developing tools and methods to help the designer 
in performing functional analyses making use of several kinds of inputs. As a first 
attempt, the procedure has been optimized to define the functional diagram of a US 
Patent [23, 14], since those documents must follow a comprehensive set of rules [20]; 
hence it is possible to tune a semantic processing algorithm such that its output can be 
suitably post-processed in order to build the functional scheme of the analyzed patent. 
This procedure has been implemented in a software tool, PAT-Analyzer [6], capable 
to identify the components of the patented system, perform a hierarchical classifica- 
tion of those components subdividing them in different abstraction levels, draw a 
functional diagram of the complete system and of the detailed subassemblies, there- 
fore providing a graphical representation of the invention described in a patent. By 
means of several types of score ranks it is possible to highlight the core of the inven- 
tion, the most peculiar components and performed functions. 

2.1 PAT-Analyzer Methodology 

The data flow is represented in Figure 1: the text analysis consists of three steps 
aimed at (i) identifying the components of the invention; (ii) classifying the identified 
components in terms of detail/abstraction level; (iii) identifying positional and func- 
tional interactions between the components both internal and external to the system. 
Several types of analysis can be performed by means of the post-processing module, 
in order to focus on the invention peculiarities. The components identification task is 
performed taking into account that all the components must be numbered univocally 
to be identified in the illustrations [22]. A lemmatizer and a set of filters and syno- 
nyms can be adopted in order to improve the quality of the results. 

Therefore, a list of reference denominations and alternative denominations is ex- 
tracted for each component. The following analysis is dedicated to the search of de- 
scriptive locutions (i.e. sentences containing verbs like “to form”, “to constitute” etc.) 
and specification’s expressions (like “the gripper of the pivot arm” ) in order to iden- 
tify subsystem/supersystem relationships, hence defining a hierarchy of detail/ab- 
straction levels (Figure 2, left). 
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Fig. 1. PAT-Analyzer data flow. 



Finally, positional and functional interactions between the identified components 
are determined by filtering, from the list of subject-action-object (SAOs) provided by 
a syntactic parser, the triads containing irrelevant verbs (Figure 2, right). Further 
details about the algorithm can be found in [5]. 

The whole set of extracted information can be synthetically represented by means 
of a diagram according to the following set of rules: 
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Fig. 2. Preliminary results obtained by the analysis of the patent US 6,097,012 [23]: hierarchi- 
cal classification of the components (left) and list of functional interactions (right). 
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1) each identified component of the system is represented by a rectangle labeled 
with its reference number and the representative name defined in the Compo- 
nents Recognition phase; each identified component or subject external to the 
system is represented by a gray rectangle labeled with a representative name; 

2) the detail level hierarchy is represented nesting the components at a deeper detail 
level inside the components at a more abstract level; 

3) the functional interactions between the identified components are represented 
with straight arrows pointing from the Tool to the Artifact, labeled with the 
Field; 

4) the positional interactions between the identified components are represented 
with dashed arrows pointing from the Tool to the Artifact, labeled with the Field. 

The post-processing module allows the identification of the most relevant concepts 

contained in the patent text by means of several metrics (examples of application will 

be provided in the next section of the paper): 

1) Detail Level Chart : on the basis of the components hierarchical classification 
performed in the text analysis phase, it is possible to assign a Detail Level (DL) 
to each TFA triad and/or to each paragraph: a DL is assigned to each component 
so that the maximum abstraction level is represented by a DL=0 and the DL of 
each subsystem is one level greater than the DL of the corresponding supersys- 
tem. The detail level of a basic sentence (TFA) is estimated as the average DL of 
Tool and Artifact; the detail level of a paragraph is the average DL of all the 
TFAs belonging to that paragraph. By analyzing the DL run along the description 
it is easy to identify the most relevant paragraphs and concepts of the inventions 
(it is worth to notice that if the patent description is focused on specific details of 
the proposed system, it means that such a part of the description is strictly related 
to the peculiarities of the invention). 

2) Components Recurrence Analysis : by counting the citations of each component 
of the proposed invention (of course taking into account of all the alternative de- 
nominations) it is possible to determine a relevance score of the components 
themselves; in order to improve the quality of such an estimation, different 
weights are assigned to the citations found in the title, abstract, independent 
claims, dependent claims, summary, description of the preferred embodiment. 

3) TFA Recurrence Analysis : by means of the same technique adopted for the Com- 
ponent Recurrence Analysis, it is useful to assign a score to the functional and 
positional triads TFA. Also partial pairs of the triads (TF, FA, and TA) are 
counted by means of the same set of weights. Therefore a rank of the most rele- 
vant sub-functions performed by the invention is determined, again in order to 
focus the attention of the user to the patent peculiarities. The partial pairs score 
allows to extract complementary information: for example, if the FA score is 
greater than the corresponding TFA, it could mean that the same component re- 
ceives the same function by several tools, i.e. a “eombination” has been used to 
reinforce the effect of the action. Besides, a TF score greater than the correspond- 
ing TFA, could mean that a parts count reduction is attempted by integrating 
functionalities in the Tool, in order to reduce costs. 




Natural Language Processing of Patents and Technical Documentation 513 



2.2 An Example of Patent Analysis 

A typical problem encountered while monitoring the intellectual property of competi- 
tors is the increasing number of patents being presented with either very simple or 
very non-inf ormative titles. The strategy behind such titles is that they are unlikely to 
attract interest. Moreover, very often even the drawings are not so explicative. 




Fig. 3. US Patent 6,459,855 “Actuator” [14]: first drawing of the patent (left); functional dia- 
gram and TF recurrence top scores (right). 



A tool capable of identifying the core of the invention automatically is therefore 
very useful to check the relevance of a patent with minimal efforts. 

As an example let’s take into account the US Patent 6,459,855 “Actuator” [14], as- 
signed to Minolta: the patent is 35 pages long and describes several embodiment 
variants; therefore understanding the real purpose of such an invention and its rele- 
vance requires a long and careful reading. The drawings are not as meaningful as 
shown in Figure 3, left: what is the function of the wires 13 and 14? 

The proposed methodology speeds up the comprehension of the system without 
the need of reading the patent text, but just combining the functional diagram with the 
drawings of the patent. In this case the list of components identified by PAT- 
Analyzer has been reduced by deleting the components not present on the main draw- 
ing. 

The resulting diagram is shown in Figure 3, right, where the most recurrent Tool- 
Field pairs have been highlighted. While in such a case the detail level run analysis 
would not provide any useful information, since the components hierarchy is abso- 
lutely flat, looking for the most cited interactions in terms of subject-function points 
directly to the invention peculiarity: the role of the wires 13 and 14, absolutely not 
understandable by means of the drawing alone, is explained by the TFAs “ambient 
temperature” - “exceeds transformation temperature of’ -“wire 13/14”. These TFAs 
clearly suggest that shape memory alloys have been used in order to “not pull up” - 
“engaging lever 11”, to “compensate” - “spring 15”, to “cause” - “charge lever 17” - 
“perform” - “predetermined operations”. 
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It is worth to note that applying statistical analyses to pre-processed data (in this 
case the identified components and their interactions) is much more effective than 
simply counting terms occurrences as usually performed by standard text mining 
tools. 

3 NLP Techniques for Software Requirements Documents 

Requirement Engineering is the branch of the software engineering which deals with 
a proper definition of the features a software should have to satisfy the needs of the 
target users. Software Requirements represent the agreement, often with a legal 
status, between the software developers and their customers: for this reason, on one 
hand they should be comprehensible by non-technical people, on the other hand they 
should precisely define the functionalities of the software to be developed, and they 
should be able to drive the later stages of the software development. They are speci- 
fied in Natural Language, which is comprehensible but may introduce ambiguity. 

A well structured requirement engineering process can be typically divided in 
three steps: first, general information about the problem to be solved are collected 
from the customers of the software product, producing a description in natural lan- 
guage; starting from the previous document, the requirement specification is made, 
which is a detailed description of the product to be implemented; finally, a software 
specification is produced to abstractly describe the software system to be imple- 
mented. Through these steps a rich, informal and potentially ambiguous natural lan- 
guage description should be converted to a formal and essential one. Actually, in 
many software houses requirement engineering results in only one document in 
which the system to be implemented is described, often using natural language only, 
with the risk of obtaining an unsatisfactory product. 

Revealing the sentences affected by ambiguity in a software requirement descrip- 
tion can be achieved by analyzing the sentence using NLP techniques from a lexical 
or syntactical point of view: for this reason, it is proper to talk about, for example, 
lexical non-ambiguity or syntactic non-ambiguity rather than non-ambiguity in gen- 
eral. Lor instance, a sentence may be syntactically non-ambiguous but it may be lexi- 
cally ambiguous because it contains wordings that have not a unique meaning. 

Lexical evaluation uses lexical parsers to detect and possibly correct terms or 
wordings that are ambiguous (i.e. that may have multiple meanings according to the 
context): the tools QuARS [9] and ARM [30] belong to this category. These tools can 
be seen as an aid for “writing the requirements right, “ not “writing the right require- 
ments”: they provide structural and quality indicators on the basis of a suitable model, 
defined considering the existing literature and experiences in the field of requirement 
engineering and software process assessment: the SPICE (ISO/IEC 15504) model 
[13], adopted by QuARS, provides for example lexical defect indicators for the Test- 
ability of a requirement description in terms of vagueness (that is, a sentence contains 
words having a non uniquely quantifiable meaning), subjectivity (if a sentence refers 
to personal opinions) optionality (a sentence containing one or more optional terms), 
and so on. Lor example, the sentence “The C code shall be clearly commented” is 
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pointed as vague because of the vagueness indicator “clearly”. The tool points out 
these defects without forcing any corrective actions, leaving the user free to decide 
whether modifying the document or not. The use of targeted dictionaries allows the 
sentences to be analyzed taking into account the particular application domain. 

Syntactical evaluation exploits syntactical analyzers to detect sentences having dif- 
ferent interpretations, as in the case of tools LOLITA [18, 17] and Circe-Cico [1, 2]. 

LOLITA is a general purpose NLP tool that is able to identify the morphological, 
grammatical, semantic and pragmatic relations of a sentence, by representing them 
with a semantic net, in a similar way as a concept graph, which is used as a knowl- 
edge base of the system: the nodes composing the net are hierarchically connected. 
The goal is to reduce the ambiguity of a sentence using statistical techniques, in order 
to identify the most likely interpretation of a requirement. The number of extracted 
parsing trees is used to compute ambiguity measures, which allow also for weights 
that are assigned accordingly to the construction criteria of the trees themselves. 

The Circe-Cico environment provides a support for identifying, selecting and vali- 
dating NL requirements, with the integration of information from the application 
domain. The Circe architecture is constituted by five components: a semantic data- 
base containing small units called atomic requirements; the modeler components 
generate a high-level model starting from the atomic requirements, that can also be 
decomposed in order to store each part in suitable syntactic archives, used together 
with the modelers to store lexical, syntactic and pragmatic contents; the projector 
components build the actual representation and parse the user input; there is a projec- 
tor for each kind of model, which is represented using a suitable intermediate lan- 
guage. A translator component produces a representation of the model accordingly 
with the language for the chosen representation (graphs, tables, and so on). The Cico 
component is a projector which produces parsing trees of the input NL documents: 
based on a fuzzy algorithm, it also uses backtracking and heuristics to optimize the 
parsing tree generation, and only lexical correctness of the sentence is required. In its 
latest version [3, 4], the Circe-Cico tool allows the formalization of NL requirements 
in Abstract State Machines (ASM) or UML diagrams. Circe-Cico, as well as 
LOLITA, requires to the user a considerable familiarity. 

Some recent studies addressing ambiguity and interpretation problems have dealt 
with a relation-based approach applied to a requirement document structured accord- 
ingly to the Use Case formalism [7], which provides for the definition of entities (the 
Actors) interacting each other and with the described system. Particularly, the studies 
presented in [10, 11] investigate methods to provide support for Consistency and 
Completeness checking, that is to detect the presence in the NL requirement docu- 
ment of semantic contradictions and structural incongruities, by means of extracting 
the relations between actors and the system. 

The experience of patents functional analysis suggests the extraction of SAOs as a 
basis for an alternative method to verify consistency and completeness of requirement 
definition; a twin tool (J-RAn, Java Requirement Analyzer) of the PAT-Analyzer is 
currently being implemented with the introduction of a novel SAO-based Content 
Analysis, both described in the next section. 
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4 SAO-Based Content Analysis 

Content Analysis has been long recognized as a mean for text analysis and informa- 
tion retrieval; a milestone is the work of Siemens [24], where the basic techniques for 
content analysis are explained: the use of dictionaries and glossaries of keywords 
allows to perform a statistics of their occurrences in a text, and to detect if the content 
of a document deals with a topic of interest; the located document sections can be 
tagged to improve searching operations. This kind of text mining technique has found 
a wide application in the Communication Science [15, 28]; its basic principles can 
also be found in the IBM’s WebFountain [29], to cite the most recent text application 
to the analysis of web documents. Recurrence analysis can be evolved in a novel 
SAO-based Content Analysis technique for the investigation of the parts of a docu- 
ment (either a patent or a requirement one) dealing with specific topics, as described 
in the following subsections. 

The SAO-based analysis allows searching for “key-concepts” instead of keywords, 
therefore providing a higher efficiency to the document analysis tools. 

The proposed technique consists in using suitable dictionaries of verbs and terms 
related to a topic of interest, and to assign a score to each SAO if its subject, verb and 
object belong to the adopted dictionaries. Giving different weights to subject, verb 
and object, it is possible to put in evidence functional relationships or terminology 
accordingly to the purpose of the analysis. The previous weights can vary depending 
on the type of document under analysis. 

The average value of SAOs score in a sentence (or in a whole paragraph) gives a 
measurement of how this sentence (or paragraph) deals with the topic of interest. This 
information can be traced in a suitable (and possibly normalized) chart showing the 
score vs. sentence/paragraph number, to obtain an overview of the entire document 
about the parts treating the topic of interest, and in which measure. 

4.1 Requirement Analyzer Tool 

The realization of the J-RAn tool has started by the observation that the extraction of 
SAO relations can constitute a basis of transferable techniques for the analysis of both 
patents and software requirements. J-RAn is able to extract the descriptive paragraphs 
related to each software requirement, by searching each requirement tag matching a 
user-defined format, which can vary depending on the document model. The extrac- 
tion of paragraphs and corresponding sentences is implemented using Phrasys NLP 
components [21]. The SAO extraction is then performed by a syntactic parser (in the 
first version, LinkGrammar [25]). 

J-RAn inherits from PAT- Analyzer the recurrence analysis technique as a mean to 
detect possible inconsistencies, by performing the SAO-based Content Analysis for 
example to search whether requirements corresponding to a given functionality are 
expressed only in the desired sections of the document: if other paragraphs are deal- 
ing with the functionality under discussion, the document probably contains a redun- 
dancy, that is source of possible inconsistencies. 
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Current experiments on different software requirement documents use a score as- 
signment in which the functional relationship is favoured (weights: Subject:3, Verb:5, 
Object:2). An example of resulting chart is reported in Figure 4. 

An accurate experimentation of SAO weights will result in the most suitable val- 
ues also for other kinds of technical documentation (such as handbooks). 

4.2 Patent Analysis 

The content analysis approach can be followed also with the purpose of enriching 
information extracted by patents. Nevertheless, in such a case the use of standard 
dictionaries and glossaries is unadvised, since it precludes the retrieval of break- 
through solutions (that typically introduce new technologies in a given field of appli- 
cation). Moreover, it is worth to remember that the relevant concepts of the invention 
can be extracted by means of the techniques described in section 2. Therefore, the 
proposed methodology consists in performing occurrence searches in the patent text, 
by looking for the top score components and/or TFA triads and TF/FA pairs. 



Content Analysis Chart 



Fig. 4. Content Analysis chart related to a Software Requirement document. 



It has been observed that such an approach is much more effective than traditional 
terminology analysis based on stemming and statistical analysis of the extracted lem- 
mas. In facts, a word or even a multi-word can be mentioned several times in a patent, 
nevertheless not representing the peculiarity of the invention. (It should be mentioned 
that also the “to be” auxiliary verb has of course a big number of occurrences, but 
“general” terms are usually filtered by means of a set of stop words). In the case of 
the actuator claimed in [14], the most cited multiword is “shape memory alloy” with 
250 occurrences. A traditional text processing technique highlights that the invention 
is related to the use of shape memory alloy properties, but doesn’t provide any 
information about how and why such technology has been adopted, since it is 
mentioned all over the document (Figure 5). 
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Besides, the PAT-Analyzer recurrence analysis identifies the “wire 13” as the most 
important component of the invention together with the “engaging lever 11” (94 and 
91 occurrences respectively, apart from the weight of the part of the document where 
they are extracted from). The paragraphs where “wire 13” is mentioned are less than 
those containing the multiword “shape memory alloy” (Figure 5), thus focusing the 
attention of a patent analyst on a more meaningful portion of the invention descrip- 
tion. Moreover, PAT-Analyzer identifies two SAOs (Figure 4) as the most relevant: 
“ambient temperature - exceed - transformation temperature of wire 13” and “ambi- 
ent temperature - exceed - transformation temperature of wire 14”: by executing a 
SAO based content analysis searching these two triads it is possible to point directly 
to the paragraphs where the invention peculiarity is described. 
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Fig. 5. Example of Content Analysis chart related to patent [14]. 



5 Conclusions 

In this paper it has been shown how the use of available NL parsers, which are any- 
way not particularly sophisticated, can support functional analysis techniques on 
technical documentation, such as recurrence and content analysis. The experiments 
carried out on Patents and Requirement documents show that these techniques are 
able to effectively support the human comprehension of these classes of technical 
documentation. 

Regarding in particular requirement documents, current work deals with the im- 
plementation in J-RAn of a feature for expressing the extracted functional relation- 
ships in the XMI format, in order to export them to commercial UML tools (Rational 
Rose, Visual-Paradigm), capable of visualizing the relationships as a UML Class 
Diagram. 
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As a future work, SAO-based Content Analysis will be experimented on other 
kinds of documentation, either technical (i.e. handbooks), or not (i.e. web and public 
administration documents), with a suitable tuning of weights depending on the appli- 
cation. 
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Abstract. Question answering (QA) is the task of retrieving an answer 
in response to a question by analyzing documents. Although most of the 
efforts in developing QA systems are devoted to dealing with electronic 
text, we consider it is also necessary to develop systems for document 
images. In this paper, we propose a method of document image retrieval 
for such QA systems. Since the task is not to retrieve all relevant docu- 
ments but to find the answer somewhere in documents, retrieval should 
be precision oriented. The main contribution of this paper is to propose 
a method of improving precision of document image retrieval by taking 
into account the co-occurrence of successive terms in a question. The 
indexing scheme is based on two-dimensional distributions of terms and 
the weight of co-occurrence is measured by calculating the density distri- 
butions of terms. The proposed method was tested by using 1253 pages of 
documents about the major league baseball with 20 questions and found 
that it is superior to the baseline method proposed by the authors. 



1 Introduction 

Question answering (QA) is the task of retrieving answers rather than docu- 
ments in response to a question with an emphasis on functioning in unrestricted 
domains[l]. Since it enables us to realize a more natural mean of “information 
retrieval” as compared to the keyword-based retrieval of documents, it attracts 
a great deal of attention in recent years. Much effort has been made including 
TREC conferences [2], as well as application to the Web [3]. In addition, some 
research groups have started offering services of Q A systems to the public [4, 5] . 

Question answering has been studied in the field of information retrieval and 
thus most of the existing QA systems work only on electronic text. But is it 
enough for us to deal only with electronic text? We consider that it is not suf- 
ficient because at least of the following two reasons. First, we have already had 
a huge amount of document images in various databases and digital libraries. 
For example, the magazine “Comm, of the ACM” in the ACM digital library [6] 
consists of 80% of document images and 20% of electronic documents. Another 
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reason is that mobile devices with digital cameras are now coming into common 
use. Some users have already utilized such devices for taking digital copies of 
documents, because it is much more convenient than writing memo^. This in- 
dicates that not only legacy documents but also new documents continue to be 
stored as document images. 

In order to utilize such document images from the viewpoint of question an- 
swering, we have started a project of developing a QA system called “IQAS” 
(document Image Question Answering System)^. In this paper, we propose a 
method of document image retrieval for IQAS, by modifying our previous method [7, 
8] . The major contribution of this paper is a way of improving precision of spot- 
ting parts that include the answer to the question. The previous method, which 
is called the baseline method in this paper, employs density distributions of terms 
for retrieving appropriate parts of images. In this paper, new density distribu- 
tions modified by taking into account the co-occurrence of successive terms in 
the question are introduced and tested by experiments on 1253 pages with 20 
questions. The results of experiments show that the proposed method is superior 
to the baseline method. 

2 Question Answering for Document Images 

2.1 Task and Configuration 

The task of QA is precision oriented in nature. This is because the user is sat- 
isfied not by having all documents containing the same correct answer, but by 
just receiving the correct answer once. In the QA task, the user is allowed to ask 
questions in natural language. Systems for electronic text developed so far have 
tackled the questions of seeking simple facts by using “who” , “what” , “which” , 
“when” and “where” . Questions using “why” and “how” generally require much 
longer and complicated answers and thus their processing is still an open prob- 
lem. 

In order to locate facts in documents, QA systems are generally based on the 
following configuration. 

1. Query Processing : The question in natural language is analyzed to obtain 
both query terms and the type of question. Query terms are employed in the 
next step of processing. The type of question defines what the question asks 
about. For example, “location”, “time” and “person” are typical types. 

2. Document Retrieval : A document retrieval engine is employed to find docu- 
ments which is likely to contain the answer. Passage retrieval, i.e., to retrieve 
small portions of text from documents, is often utilized in this step. 

3. Answer Extraction : The final step is to locate the answer in the retrieved 
passages with the help of types. Named entity extraction is applied to the 
extracted passages so as to locate the terms representing the answer to the 
question. 

^ In Japan, digital shoplifting, i.e., to make pictures of books and magazines in book- 
stores by mobile phones, has become an object of public concern. 

^ The pronunciation of the term IQAS is close to “Ikasu” in Japanese which have the 
two meanings: “to exploit” and “cool”. 
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Fig. 1. System configuration. 



Our system IQAS also follows the above configuration. Figure 1 illustrates 
the system configuration of IQAS. In this paper we focus only on the second 
step, which is “document image retrieval” in our case. 



2.2 Related Work 

Document image retrieval has studied in both fields of information retrieval [9] 
and document image analysis [10]. One of the central issues has been how to cope 
with OCR errors. Other errors in higher level analyses such as layout analysis 
and logical labeling cause less influence to the retrieval results if retrieval systems 
are based on the well-known “bag of words” (BOW) model. This is because only 
the frequency of terms is utilized in the BOW model. 

However, this does not hold for passage retrieval and question answering, 
since these are to segment parts defined based on the results of higher level anal- 
yses. Thus methods for dealing with errors in higher level analyses are required. 
Although a straightforward way is to improve the accuracy of high level analy- 
ses, we have taken an indirect way by proposing a different retrieval method [7, 
8], which is an extension of the original work on electronic text [11] to the two 
dimensional space. The characteristic point of the method is that it relies only on 
positions of terms in original pages. Parts are segmented not on the recognized 
text but on the two dimensional space of page regions. Density distributions of 
terms in the query are employed for locating parts relevant to it. This enables 
us to retrieve parts of document images independently of the results of higher 
level analyses. 

In this paper, we improve the above method of density distributions to be 
better suited for precision oriented retrieval. 

3 Document Image Retrieval 

3.1 Overview 

The basic concept of the proposed method is to find parts of documents which 
densely contain terms in a query. The processing consists of the three steps shown 
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in Fig. 1. Taking as input a set of index terms or a query extracted by the query 
processing, filtering is first applied to select pages which are likely to contain an 
answer to the question. Then the density distributions are calculated to find the 
parts which densely contain terms in the query. Finally, relevant parts are found 
based on the density distributions. 

In the following, the details of each step are explained after a brief introduc- 
tion of indexing and query processing. 

3.2 Indexing 

The process of indexing is basically the same as in our previous method. First, 
all words and their bounding boxes are extracted from page images with the 
help of OCR. Second, stemming and stopword elimination are applied to the 
extracted words^. The resultant words are called index terms (or simply terms) 
and stored with the centers of their bounding boxes. In other words, each page 
image is viewed as a two dimensional distribution of terms in it. 

A new functionality introduced to the proposed method is the normalization 
of image size. In general, page images have various layouts. Some documents such 
as newspapers and technical journals may have multi-column layouts and thus 
densely contain a lot of terms in one page. On the other hand, others may have 
single-column layouts with a wider interline spacing and thus contain less terms. 
Documents with multi-column layouts would, therefore, be unevenly promoted 
if we simply computed the density of terms. 

To avoid this harmful effect, it is necessary to normalize the size of page 
images. As the normalization constant C, we employ C = i?rn/5 where Hm is 
the mode of textline height included in each document. 

3.3 Query Processing 

The task of query processing is both to identify the type of question as well 
as to extract index terms from the question. Suppose we have a query “Where 
is the Baseball Hall of Fame?” . The query type is “location” from what it is 
asking and the index terms are “baseball”, “hall” and “fame”. Note that only 
the extraction of index terms is relevant to the task of document image retrieval. 
In the following, the sequence of extracted index terms is called the query and 
represented as q = {qi , ..., qu) where qi is called a query term and i indicates the 
order of occurrence in the question. For the above example, q =(baseball, hall, 
fame) . 

3.4 Filtering 

Filtering is applied to ease the burden of the next step which is relatively time- 
consuming. The task here is to select iV„ pages that are likely to include the 
answer to the query. 

® Stemming is the process of normalizing words by keeping only word stems, e.g., from 
“processes” to “process”. Stopwords are words that convey little meaning such as 
“a” and “the”. 
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For this purpose we utilize the simple vector space model (VSM) [12]. In the 
VSM, both a page pj and a query q are represented as m-dimensional vectors: 

Pj = {wij , Wmj) 5 (1) 

q = (w\q, ...,Wmq) i (2) 

where T indicates the transpose, Wij is a weight of a term ti in a page pj, and 

Wiq is a weight of a term ti in a query q. In this paper, we employ a standard 

scheme called “tf-idf” defined as follows: 



w^j = ■ idfi , 



(3) 



where tfij is the weight calculated using the term frequency fij (the number of 
occurrences of a term ti in a page pj), and idfi is the weight calculated using the 
inverse of the page frequency Ui (the number of pages containing a term ti). In 
computing tfy and idfj, the raw frequency is usually dampened by a function. 
We utilize tfy = log(/y + 1) and idfi = log(n/rii) where n is the total number 
of pages. The weight Wiq is similarly defined as Wiq = log{fiq + 1) where b, is 
the frequency of a term ti in a query q. 

The similarity between a page Pj and a query q is measured by the cosine of 
the angle between pj and q: 



sim{pj,q) 



pjq 

\\Pj\\ II qI 



(4) 



where || • || is the Euclidean norm of a vector. Pages are sorted according to the 
similarity and top 7V„ pages are selected and delivered to the next step. 



3.5 Calculation of Density Distributions 

This step is to calculate density distributions of the query q for each selected 
page. Density distributions of the query is defined based on those of each query 
term qi. Let y) be a weighted distribution of a term qi{= tk) va & selected 

page p defined by: 

(p), . _ J idffe if qi{=tk) occurs at {x,y), , . 

I y) Q otherwise , 

where (x, y) is the center of the bounding box of a term. A density distribution 
D^^\x,y) is a weighted distribution of qi smoothed by a window W{x,y): 

M„/2 My/2 

D^i^\x,y)= ^ ^ W{x-u,y-v)T^^\u,v) . (6) 

u— — Mxl‘2. v— — Myl2 

As a window function, we utilize a pyramidal function with the window widths 
My; (the horizontal width) and My (the vertical width) shown in Fig. 2. 
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Fig. 2. Window function. 



As discussed in 2.1, document image retrieval for the QA task should be 
precision oriented. An easy way of making retrieval precision oriented is to find 
parts which densely contain all the query terms. This is achieved by the point- 
wise multiplication of corresponding density distributions: 

U 

{x, y) = n (a;, y) , (7) 

i=l 



where u is the number of query terms. 

However, this causes a problem in many cases because it is relatively rare that 
all the query terms co-occur within a small region defined by the window func- 
tion. In other words, (x, y) is zero if at least one of the density distributions 
D^^\x,y) has the value of zero. 

A way to avoid this undesirable situation is to relax the requirement. In this 
paper, we consider the smaller number of successive query terms. For example, 
the density distribution obtained by u — 1 successive query terms is defined by 

y) = n y) + n 

2—1 

The reason for taking account of only the successive terms is that they are 
more relevant as compared to those randomly selected. For the general case of 
k successive query terms, the density distribution is defined by 

u—k j-\-k 

Ck\x,y) = n ■ ( 9 ) 

j=0 i=j+l 

In the proposed method, the density distribution of the whole query for a 
page p is defined as the weighted sum of the combinations from all the u terms 
down to s successive terms: 



D^P^x^y) =^akCf\x,y) , ( 10 ) 

k—s 

where the parameter s and the weight are experimentally determined. 
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the part with higher density 



: Hideki "Godzilla" Matsui makes a . 



KEl»OKT- 



Fig. 3. Graphical user interface. 



3.6 Spotting Relevant Parts 

Based on the density distribution of Eq. (10), parts which are likely to include 
the answer are located on page images. First, page images are ranked according 
to their score of the maximum density: 

s^p') = max D^P\x,y) . ( 11 ) 

x,y 

Then, the top-ranked page is presented to the user through the GUI shown in 
Fig. 3. In this figure, the part with high density is highlighted. The user can 
magnify the retrieved part in the page. If it does not contain the answer, the 
user can retrieve the next page. 

4 Experimental Results 

4.1 Data and Parameters 

The proposed method of document image retrieval was tested using PDF doc- 
uments about the major league baseball. The number of documents and the 
total number of pages are 197 and 1253, respectively. We employed PDF docu- 
ments because it is the easiest way for us to obtain terms and their coordinates 
with no OCR errors. We consider that such clean data would be appropriate for 
evaluating the method as the first trial. 

For the above documents, we prepared the queries shown in Table 1. Some 
queries are associated with several possible answers delimited by commas; we 
regarded an output of the method as correct if at least one of them is included. 
Some answers consist of several terms like “setup man” for the query 11. In 
such cases, an output must include all of them to be regarded as correct. The 
parentheses in Table 1 indicate stopwords in the answers; these were not checked 
for marking. 

Table 2 lists the ranges of parameters tested in the experiments. As the 
unit of length for the window size, 1/5 of the mode of textline height in each 
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Table 1. Queries. 



Id 


Query 


Answer 


1 


What is the oldest stadium in Japan? 


Koshien 


2 


Who is Godzilla? 


Matsui 


3 


Who is the American League Leader in hits? 


Ichiro 


4 


Who is the American League Leader in batting 
average? 


Ichiro 


5 


Who is BRET BOONE? 


(All-)star (second) baseman 


6 


From what are baseball gloves made? 


cowhide 


7 


From what are baseball bats made? 


wood 


8 


What variations are thrown in the major league? 


seam fast ball, changeup, curve- 
ball, slider, split finger, forkball, 
knuckleball 


9 


Which team uses Koshien as home? 


Hanshin 


10 


Who is Shigetoshi Hasegawa? 


setup man 


11 


Where was Ichiro Suzuki born? 


Japan, Tokyo, Honshu 


12 


Where is the Baseball Hall of Fame? 


New York 


13 


Who is the world’s best-known athletes? 


Sosa, Jeter, Piazza, Rordriguez 


14 


Who is the most dominant and visible athlete in 
Japan? 


Ichiro 


15 


Which stadium known as the House that Ruth 
Built? 


Yankees 


16 


What is First Aid Kit Rule? 


first aid kit 


17 


What team does Mark McGuire play for? 


Cardinals 


18 


What team did Babe Ruth play for? 


New York Yankees 


19 


What record is Mark McGwire close to breaking? 


(the most) homeruns (in one) 
season 


20 


Which is the most famous stadium in Japan? 


Koshien 



document is utilized. In the experiments, the window size varied from the height 
of 3.6 (=18/5) textlines to 20 (=100/5) textlines. The value of s indicates the 
minimum number of combined terms. Thus if a query includes five terms and 
s = 2 is applied, the successive combinations of two terms up to five terms are 
considered in calculating the density distributions. As for the value of ak, we 
tested “1” (equal weight) and “A:” (varied weight). Since k corresponds to the 
number of combined terms, combinations with a larger number of terms are 
more important in the case of Ofc = fc. The number of pages Ny selected at the 
filtering was fixed to 10 throughout the experiments. 



4.2 Evaluation and Experiments 

The output of the method is the ranked pages with their density distributions. 
We regarded a page as correct in case a correct answer listed in Table 1 is found 
in the Nt nearest terms to the peak of the density distribution in the page. For 
each query, top five pages obtained by the method are verified whether they are 
correct. 
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Table 2. Parameters and their ranges. 



Parameter 

width of the window Mx 

height of the window My 

min. no. of terms combined in Eq.(lO) s 1 



Range 

18 ~ 100 step 4 t 
18 ~ 100 step 4 I 
total no. of terms in a query 



the weight for cl^\x,y) 



Oik 



= 1 or = fc 



I measured in units of 1/5 of the mode of textline height. 



The results were evaluated using the score called “mean reciprocal rank” 
(MRR) defined as the average of “reciprocal ranks” for all queries. The reciprocal 
rank of a query is calculated as 1 /r where r is the rank of the page which first 
contains the correct answer. For example, if the third-ranked page first contains 
the correct answer, the reciprocal rank is 1/3. 

Experiments were carried out based on the leave-one-out cross validation, 
i.e., values of parameters were selected by training based on the all but one left- 
out query and the selected values were applied to the processing of the left-out 
query as a test. MRR was obtained by leaving out every query and averaging 
resultant reciprocal ranks. 

For the purpose of comparison, we applied the simplest variant of our previous 
method [7, 8] as the baseline. In this method, density distributions are calculated 
based not on Eq.(lO) but on the following: 

U 

D(PHx,y)='£D^^\x,y) . ( 12 ) 

i=l 

Except for this difference, all processing steps are shared with the proposed 
method. 

4.3 Results and Discussion 

Let us first show the results of training. Table 3 shows MRR obtained through 
the training. As the number of nearest terms Nt, which is related to the accuracy 
of results, we utilized 30 and 10; the task is harder with a smaller Nt. As shown 
in this table, the proposed method outperformed the baseline method for both 
values of Nt. 



Table 3. MRR and values of parameters obtained by training. 





Nt 


MRR 


Mx 


My 


s 


Oik 


ave. 


mode 


ave. 


mode 


proposed 


30 


0.626 


67.6 


70 


75.8 


78 


2 


k 


method 


10 


0.579 


58 


58 


77.6 


78 


2 


k 


baseline 


30 


0.503 


42.4 


38 


38 


38 


— 


— 


method 


10 


0.490 


33.8 


30 


33.4 


30 


— 


— 
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Fig. 4. Examples of density distributions. 



proposed method (Nt=10) 




baseline method (Nt=10) 



Values of parameters selected at the training are also listed in Table 3. Let 
us next discuss the values of s and ctfc) both of which are only for the proposed 
method. The proposed method often performed best with s = 2 and a = k for 
both Nt’s. The value s = 2 indicates that it is better not to take into account 
the case of s = 1, i.e., the distributions of single terms. As stated in Sect. 3.5, 
s indicates the requirement of “co-occurrence” of successive terms within the 
window region. The baseline method is, on the other hand, to calculate density 
distributions by taking into account only the case of single terms (see Eq.(12)). 
Thus the results indicate that the co-occurrence plays an important role for 
locating the answer accurately. The selection of a = fc means that the “co- 
occurrence” with a larger number of terms is more important than those with 
less terms. 

Let us turn to the window widths. Table 3 shows that (1) smaller windows 
are required for smaller Nt for locating the answers more accurately, and (2) the 
baseline method requires smaller windows as compared to the proposed method. 
Because smaller windows provide us less capability of smoothing the distribu- 
tions, they are not desirable from the viewpoint of stability of the processing. 
For example, the baseline method with Nt = 10 uses the window of size 30 x 30. 
Since the typical height of body textlines are normalized to 5, the window is of 
size 6 lines. On the other hand the proposed method employs windows of size 
12 to 16 textlines. 

Examples of density distributions are illustrated in Fig. 4. The baseline 
method yielded some spikes in the distributions. On the other hand, the pro- 
posed method generated smooth distributions. In general, larger windows allow 
us to obtain smoothness, though they spoil the accuracy of locating the answers. 
The proposed method avoids this side effect by using the combinations of terms. 
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Table 4. Results of test. 





Nt 


1 


2 3 4 5 6 


7 8 9 


query Id 

10 11 12 13 14 15 16 17 18 19 20 


MRR 


proposed 


30 


1 


11111 




1 


1 


0 


0 


1 


0 


1 


0 


0 


0 


0 


0.575 


method 


10 


1 

2 


1 1 y 1 1 


0 0 0 


1 


1 


0 


0 


1 


0 


1 


0 


0 


0 


0 


0.450 


baseline 


30 


T 

3 


iiioi 


1 0 0 


0 


5 


0 


0 


1 


0 


T" 

2 


0 


0 


0 


~r 

3 


0.298 


method 


10 


1 

2 


1 1 1 0 i 


1 0 0 


1 


0 


0 


0 


1 


0 


1 

2 


0 


0 


0 


1 


0.379 



Table 4 shows the results of test for the left-out queries. In this table, “query 
Id” indicates the left-out query and the numbers for them represent the reciprocal 
ranks. 

For the queries 8, 12, 13, 15, and 17-19, neither of the methods could find the 
answers within top five pages. This was partly due to repetitious use of general 
query terms in pages. For example, the query 8 includes the terms “variation”, 
“throw” , “major” and “league” all of which are commonly used in documents 
on the major league baseball. Another and more important reason is that the 
methods are without the “filtering” capability based on the type of queries. For 
instance, the query 12 asks the location but only one page among all top five 
pages (in total 20 pages) included the name of location. Filtering would allow 
us to solve the problem as in the systems for electronic text. 

For the queries 2, 3 and 14, both of the methods found the answers in the 
top ranked page. The difference between the methods was caused by the rest. 

For the query 20, the proposed method was inferior to the baseline method. 
This was caused by the erroneous selection of the value of s in the proposed 
method. In this case, s = 3 is selected by the training. Since the query includes 
three terms, the selection indicates the requirement of co-occurrence of all terms. 
But unfortunately, no page included all within the size of the window. 

For the queries 5, 6, 9, 11 and 16, on the other hand, the proposed method 
outperformed the baseline method. In most of these cases, the baseline method 
yielded erroneous results due to the repetitious use of some terms in the query. 
For example, a page including the term “glove” frequently was erroneously 
ranked at the top by the baseline method, though the query also includes the 
terms “baseball” and “made”. For these cases, therefore, the proposed method 
which put additional weights for co-occurrence of terms was successful. 

In total, the values of MRR show that the proposed method outperformed 
the baseline method for both values of Nt. 



5 Conclusion 

In this paper we have presented a method of document image retrieval that 
is modified to be precision oriented for the task of question answering. The 
characteristic point of the method is that it takes combinations of successive 
query terms into account when calculating density distributions. This allows 
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us to improve the accuracy of locating answers without increasing the spurious 
spikes in the distributions. From the experimental results we confirmed that the 
proposed method outperformed the baseline method. 

Future work includes experiments with a larger number of queries and doc- 
uments, as well as with OCR’ed documents. The implementation of the whole 
system with the capabilities of “query type identification” and “answer extrac- 
tion” is also important future work. 
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Abstract. Braille is the most effective means of written communication be- 
tween visually-impaired and sighted people. This paper describes a new system 
that recognizes Braille characters in scanned Braille document pages. Unlike 
most other approaches, an inexpensive flatbed scanner is used and the system 
requires minimal interaction with the user. A unique feature of this system is the 
use of context at different levels (from the pre-processing of the image through 
to the post-processing of the recognition results) to enhance robustness and, 
consequently, recognition results. Braille dots composing characters are identi- 
fied on both single and double-sided documents of average quality with over 
99% accuracy, while Braille characters are also correctly recognised in over 
99% of documents of average quality (in both single and double-sided docu- 
ments). 



1 Introduction 

Information in written form plays an undeniably important role in our daily lives. 
From education and leisure, to casual note taking and information exchange, recording 
and using information encoded in symbolic form is essential. Visually impaired (blind 
and partially sighted) people face a distinct disadvantage in this respect. Addressing 
this need, the most widely adopted writing convention among visually impaired people 
is Braille. Since its inception in 1829, significant developments have taken place in the 
production of Braille and Braille media as well as in the transcription of printed mate- 
rial into Braille. 

However, although the production of Braille documents is relatively easy now, the 
problem of converting Braille documents into a computer-readable form still exists. 
This is a significant problem for two main reasons. First, there is a wealth of books 
and documents that only exist in Braille that, as with other rare/old documents, are 
deteriorating and must be preserved (digitized). Second, there is an everyday need for 
duplicating (the equivalent of photocopying) Braille documents and for translating 
Braille documents for use by non-Braille users. The latter application is quite impor- 
tant, as it forms the basis for written communication between visually impaired and 
sighted people (e.g., a blind student submitting an assignment in Braille). 

The automated recognition of Braille documents is not straightforward due to the 
special characteristics of the documents themselves (see below) and the constraints of 
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the application domain. More specifically, in addition to the natural expectations for 
high efficiency demanded from a document conversion application, a Braille recogni- 
tion system must also be easy to use by visually impaired people (it should not require 
complicated setup etc.) and cost-effective (should only use commercially available 
standard equipment). 

In order to appreciate the idiosyncrasies of the problem and to set the scene for the 
subsequent description of the system, a brief description of Braille and of automatic 
Braille recognition systems is given below. For a comprehensive account on the sub- 
ject, the reader is referred to another publication [1] by the first author. 

Braille is a particular system of representing information in tactile form. As such, 
Braille documents are formed by groups of protruding “dots” representing characters 
and various symbols (including music). Each Braille character is an arrangement of 
six points in two columns of three (the Braille cell) as can be seen in Fig. 1 (there is 
also an 8-point representation but, for brevity, it is not covered here). Each point can 
be either raised (a protrusion, or Braille dot) or flat. The height and diameter of protru- 
sions as well as the distances between dots in the same and adjacent Braille characters 
are standardized - e.g., by the UK’s Royal National Institute for the Blind (RNIB) and 
by the Library of Congress in the US. 

The meaning of each Braille character depends on the type of Braille encoding 
used. In Grade 1 Braille, for instance, there is a one-to-one correspondence between 
printed characters and their Braille representation (with the exception of some punc- 
tuation marks). In Grade 2 Braille (also called contracted, or literary Braille) there are 
conventions for representing whole strings of printed characters by a single Braille 
one. A particular Braille character may correspond to more than one string of printed 
text and this association is only resolved by examining the context of the character 
(e.g., whether the character is on its own, or in the beginning, middle or end of a 
word). There are, of course, different conventions for Braille in different countries 
(e.g., English, American, French, etc. Braille) and domains (e.g., music, mathematics, 
computer code etc.) 



1004 


• O 


• • 


2005 


O# 


o# 


3006 


• O 


• o 


(a) 


(b) 



Fig. 1. (a) The Braille cell, (b) The word “ON” in Grade 1 English Braille (filled dots represent 
protrusions). 

The nature of Braille has direct implications on the physical characteristics of docu- 
ments. The thickness of the page material (most commonly card) and the added 
thickness introduced by the protrusions result in very bulky documents, in comparison 
to printed documents containing the same information (a dictionary can occupy a 
whole bookcase). One attempt to make Braille documents more concise is the intro- 
duction of contracted Braille, as explained above. Another attempt, at the physical 
level, is the introduction of double-sided Braille documents. These documents, also 
referred to as inter-point Braille have dots (protrusions) on both sides of the page. The 
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implication is that on a given page of inter-point Braille there exist protrusions (the 
dots of the recto) and depressions (the dots of the verso). This fact impacts on the 
difficulty of the recognition problem as will become evident in the discussions in the 
remainder of this paper. Most bound Braille documents (e.g. books) encountered will 
be double-sided, while most personal documents produced with a Braille typewriter or 
printer will be single-sided. 

An obvious, perhaps, but significant characteristic of Braille documents is the ab- 
sence of any information visible in a colour contrasting the background. The only 
information recorded on a Braille page is in terms of the protrusions created by em- 
bossing the card (under uniform illumination a Braille document page appears blank). 
The fact that Braille documents are not intended to convey any visual information also 
has repercussions on the quality of card used to produce them. It is not uncommon for 
the card to be of low (visual) quality, with visible grain and imperfections (dark and 
light regions). This fact can affect the recognition of Braille documents by visual 
means (the objective of any realistic automated conversion system). 

There have been a number of attempts to recognize Braille documents using rela- 
tively complex (in terms of a visually impaired person) setups of a camera and oblique 
lighting (in order to reveal shadows from the protrusions) [2][3][4]. In addition to the 
non-standard setup and equipment (potentially expensive and difficult to use), the 
images obtained frequently suffer from the problems of camera-based image acquisi- 
tion (e.g., aberrations, irregular lightness, relatively low resolution etc.). 

In contrast, the use of a commercially available flatbed scanner facilitates image 
acquisition and it is a cost-effective solution. The only crucial requirement is that the 
illumination of the Braille document is non-uniform, i.e. there is only one light source 
in the scanner and at a slight distance from the image sensor (CCD). While scanner 
output quality has increased dramatically in the last decade and the price of top-quality 
scanners has dropped equally dramatically, manufacturers have not eliminated the 
“problem” of non-uniform (sideways) illumination. In fact, there is an abundance of 
very low priced scanners that “suffer” from non-uniform illumination. It is very fortu- 
nate that precisely this kind of scanner fulfils both the cost effectiveness and suitability 
criteria for optical Braille recognition. 

One of the first approaches to use a flatbed scanner to appear in the literature is that 
of Ritchings et al. [6]. It is applied to both single and double-sided Braille documents, 
scanned at 100dpi at 16 greylevels (for economic reasons at the time). It performs few 
image-based operations and it is relatively flexible to skew as it identifies Braille char- 
acters based on character-region search. Results reported for double-sided Braille 
documents were around 96.5% correct (recognition of recto Braille characters). 

The approach of Mennens et al. [5], also using a flatbed scanner, appeared in paral- 
lel in the literature. Higher quality images are obtained (200dpi at 256 greylevels) and 
dots are located using image-based operations (performing correlation with a particu- 
lar mask). To identify Braille characters, a grid is placed on the image where Braille 
dots are expected to be. Results reported are good (99.75% correct Braille character 
recognition) on documents without major defects but the approach fails when distor- 
tions are present (due to the fixed grid). 
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The new approach described in this paper is a radical re-development of the ap- 
proach of Ritchings et al. [6]. There are several improvements on both approaches 
mentioned above, while the cost effectiveness and usability of the system (using stan- 
dard equipment) remains at least as high. A major advantage of the new system is the 
robustness it achieves at each stage of the process, even with poor quality documents. 
Image-based operations are drastically reduced to a minimum (only an initial thresh- 
olding and labeling) and a flexible grid is used to recover from potential errors. Most 
importantly, it is the first system to perform contextual post-processing at the word 
level using knowledge of likely specific Braille production errors. 

An overview of the new system is given in the next section, where the stages in- 
volved are described in more detail, each in a separate subsection. The paper con- 
cludes with presentation of results and discussion. 

2 The System 

The new system is applied to both single-sided and inter-point (double-sided) Braille 
documents. In the case of double-sided documents, the Braille characters on both sides 
are recognized from the image of only one side of the page. The system comprises the 
following stages. 

First, an image of a Braille document page is obtained using a flatbed scanner. The 
image is thresholded so that only three classes of regions exist: dark, light and back- 
ground. Having labeled each of the different types of regions, an initial identification 
of Braille dots is performed. A flexible grid of possible dot locations is then con- 
structed and any dots that were not previously detected are recovered. Braille charac- 
ters are subsequently recognized and, if the type of Braille is known (e.g.. Grade 1) 
they are translated into the equivalent printed text. Finally, using this interpretation, a 
suitable dictionary and awareness of common Braille errors, post-processing is per- 
formed to correct wrongly recognized Braille characters. 

Each stage is described below in more detail. 

2.1 Scanned Braille Characteristics 

In most low-cost scanners, the document page is illuminated from an offset angle. The 
direct implication for Braille documents is that the illumination of a protrusion in that 
page will not be even. The face of the protrusion, which is angled towards the light 
source, will be more brightly lit and the face of the protrusion angled away from the 
light source will be considerably less brightly lit. It is this property that can be ex- 
ploited to enable the recognition of Braille documents. 

The scanned Braille page appears with a mid-gray background, and for each protru- 
sion and depression a highlight and shadow pair is present along the scanning direction 
(depressions are only present in double-sided documents). The order in which the 
shadow and highlight appear for each dot depends upon the model of scanner in- 
volved. Some models represent protrusions as shadow areas over highlight areas while 
other scanners produce the reverse. The scanner used with this system produces the 
former pattern and the possibility to reconfigure the system to work with other scan- 
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tiers is provided. An example of a typical scanned double-sided Braille document can 
be seen in Fig. 2. 




Fig. 2. An example of a scanned double-sided Braille document. 

Experiments [5] have shown, and has been verified by the authors, that a scanning 
resolution of between 80dpi and 200dpi is the most appropriate for Braille documents. 
The system described here has been tested with images in the full range, however, the 
majority of the examples presented are 100dpi images. 

The colour depth chosen for the images input to the system is 8-bit (greyscale). In 
actual fact the original scanned documents are in 24-bit colour. This can be useful to 
segment (in a potential pre-processing step) some paper defects, annotations or stamps 
introduced by archivists/librarians, for instance, if present. 

2.2 Pre-processing 

Since there are only three classes of useful information (shadows, light areas and 
background), a preprocessing step to reduce the greylevels in the image (to 3) is nec- 
essary. To cope with significant (in many cases) variations in lightness across the 
whole image, a local adaptive thresholding method was introduced. 

The method works by dividing the image into 32x32 pixel regions (the window size 
is experimentally derived) and assesses whether each region contains whole dots, 
highlight(s) only, shadow(s) only, or just background. This assessment is based on a 
comparison of sets of ranges of greylevels observed in the region against equivalent 
ranges that are expected when a particular feature (dot, highlight or shadow) is present 
or not. In each of those four different cases, a different threshold (or a fixed value in 
the case of background regions) is applied to the pixels of the region. 

The resulting image will have only black regions (corresponding to the shadows), 
white regions (corresponding to the highlights) and mid-grey (the majority, corre- 
sponding to the background). An example of a region of the image after this stage is 
shown in Fig. 4(b). 

2.3 Initial Braille Dot Location 

Braille dots manifest themselves in the image as white/black region pairs. The region 
order depends on whether the dot is a protrusion (recto) or a depression (protrusion on 
verso). Given a particular scanner model, we can define for the purpose of clarity in 
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this paper that a protrusion will be represented in the image as a black region over a 
white one (and a depression will be represented with the two regions in reverse order). 

The black and the white regions are not usually connected; they are separated by 
background pixels. However, it is not uncommon for regions of the same type to be 
connected. This situation happens only in double-sided documents, in which case the 
white region, for instance, of a protrusion is merged with the white region of a depres- 
sion. This is one of the major problems encountered when attempting to identify dot- 
forming pairs (see below). 

In the interest of efficiency, the image regions are labeled next and the subsequent 
processes in the system are carried on the connected component representation rather 
than on the image. There now exist two lists of connected components (bounding box 
co-ordinates), one for the black regions and one for the white regions. 

Braille dots can be located by identifying vertically adjacent pairs of white/black 
regions. In double-sided documents, dots on both sides are identified at the same time. 
It should be pointed out that in the latter case it is not sufficient to locate a protrusion, 
say, by simply identifying a white region that lies below a black one. Due to the white 
regions and the black regions of protrusions and depressions being frequently merged, 
the dot locating method must be careful not to discard any black and white regions that 
have been used to describe a protrusion but that can also be used to describe a depres- 
sion. An example of protrusions and depressions with merged back/white regions can 
be seen in Fig. 3. 





(a) (b) 

Fig. 3. (a) Example of merged components from depressions and protrusions, (b) with the corre- 
sponding dots (depressions/protrusions) outlined. 

The algorithm to identify protrusions and depressions proceeds as follows. Each of 
the black connected components is analyzed in turn. If a white region exists above it 
within the expected limits (the expected size of the Braille dot in the given resolution), 
a depression has been found. Similarly, if a white region exists below it within those 
same limits, a protrusion has been found. Components that have been used for the 
location of a dot are marked as used and should not be considered again, unless their 
width is greater than the expected width of a Braille dot (at the given resolution). In 
that case, the wide component is split into two and only the part contributing to the dot 
will be marked as used, leaving the remainder to be used in the creation of the 
neighbouring dot (from the other side of the document). 

The algorithm is cautious at this stage and only up to two dots on opposite sides of 
the document (a depression and a protrusion) will be identified, even in complex situa- 
tions where combinations of more than two dots exist (e.g., the case in Fig. 3). Addi- 
tional dots will be recovered with high confidence once topological information is also 
known (see below). 
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Upon completion of the algorithm two lists have been created, one storing the loca- 
tions of protrusions and the other the locations of depressions (as the coordinates of a 
single point representing the top-left extremity of the upper - white or black - compo- 
nent). An example of located dots (protrusions in this case) is shown in Fig. 4(c). 

2.4 Grid Formation 

By definition, the placement of dots within a Braille character is regular. The space 
between adjacent characters in the same character line and between adjacent character 
lines is also regular (albeit small distortions and scanning skew may be present). 
Therefore, one could construct a grid whose intersections determine the possible posi- 
tion of dots on the page image. For double-sided documents, there will be two grids 
for each image, one for the protrusions (recto dots) and one for the depressions (verso 
dots). This is not a new idea, it has been used before [5] but in a fixed way that does 
not account for possible slight variations in character positioning in different lines. 

The system described here constructs a relatively flexible grid by allowing varia- 
tions in the positions of characters between different lines (i.e., the Braille characters 
need not be aligned in the vertical direction). The positions of all possible protrusions 
and depressions are calculated based on the dots identified so far. The grid-forming 
method is tolerant of wrongly recognised dots (in non-valid positions), of lines not 
containing any Braille characters and of characters with no dots in one of the two 
columns. The following process takes place for each of the lists of protrusions and 
depressions (separately). 

First, rows of dots are identified by grouping together dots (points) that have the 
same (within a small tolerance) y-coordinates. This simple process is successful in 
identifying rows of dots even in the presence of small amounts of skew. For larger 
amounts of skew, it is straightforward (and fast) to perform a Hough transform on the 
dots to identify the precise orientation of the rows and “rotate” them so that they are 
horizontal - this correction is performed by adjusting the coordinates of dots in each of 
the two lists (protrusions and depressions). 

Having identified the rows of Braille dots, a frequency histogram of the vertical 
distances between adjacent rows is calculated. The histogram should have two main 
peaks, one indicating the intra-character vertical distance between dots, and the other 
the inter-line (vertical) distance, i.e., the vertical distance between the bottom row of a 
Braille character line and the top row of the next. 

Given the two vertical distances, consecutive pairs of rows of dots are examined. If 
their distance is judged to be the intra-character one they are labeled as rows of a spe- 
cific character line. If, on the other hand, the vertical distance is judged to be the inter- 
line one the first one is labeled as the bottom row of one character line, and the second 
one as the top row of the next line. It should be mentioned that the labeling process 
keeps track of the position of each row within a character line and labels each row of 
dots accordingly as the first, second or third row of a specific line. 

Naturally, not all rows of dots will be labeled at this stage, as some Braille charac- 
ter lines happen not to have all three rows of dots present, therefore giving rise to 
larger vertical distances between their row(s) of dots and those of adjacent character 
lines. To identify the position of an unlabeled row within its character line (first, sec- 
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ond or third row), the vertical distances between each row of the adjacent character 
lines and the unlabeled row are examined. If a distance is found to be a multiple of the 
inter-line one, the unlabeled row will receive the same label as the other row involved. 
For instance if the vertical distance between the second row of the character line above 
and the unlabeled row is a multiple of the inter-line distance, the unlabeled row will be 
identified as the second row within the current character line. 

Having identified all existing rows of characters, a similar process takes place to 
identify and label the columns of dots within Braille characters. In this case the fre- 
quency histogram of horizontal distances is used to identify intra-character and inter- 
character column distances. 

At the end of both the row and the column labeling processes, the Braille dot grid (a 
complete description of all possible dot locations in the image) is constructed. This 
construction process takes into account the “missing” dots in addition to those identi- 
fied from the white and black components. 

The resulting grid, therefore, is a list of all the possible valid dot occurrences, given 
the identified valid character lines. Since care has been taken in the process to allow 
for possible deformations, the grid is not a fixed global grid but it is flexible in taking 
into account local variations between lines of characters (i.e., independent column 
positions in each line). This flexibility is ensured by constructing a dot grid for each 
Braille character line and then joining the individual grids to form one for the while 
page. An example visualization of the grid for a given portion of a character line (with 
gridline intersections indicating possible dot locations in that character line) is shown 
in Fig. 4(d). 



2.5 Additional Braille Dot Recovery 

The initial Braille dot location method (Sec. 2.3) successfully identifies most dots (on 
average about 98%). One of the reasons for the missed dots is that the initial location 
process is cautious in situations where black and white regions have been merged 
across more than two dots (recto and verso). At that stage it was necessary to be cau- 
tious, as there usually exist considerable amounts of noise regions that could be inter- 
preted as parts of dots. An example of the situation where a series of dots from both 
sides of the document have merged regions is shown in Fig. 3(b) with the dots out- 
lined. 

Having the list of valid dot positions in the image (grid), the system can now verify 
the dots previously detected and, more importantly, attempt to recover dots that were 
previously missed. 

To identify valid missed dots, each of the vacant locations in each of the two grids 
(separately for protrusions and depressions) is examined. Similarly to the initial dot 
location process, the system searches for a valid black/white or white/black compo- 
nent arrangement (depending on which of the two grids is used). In this case, however, 
the size/distance constraints are more relaxed (although care is taken not to use black 
or white components corresponding to noise - typically very small components com- 
prising less than 3-4 pixels). 
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(a) 






(d) 



Fig. 4. (a) Part of a scanned Braille document, (b) Result of pre-processing, (c) Result of initial 
dot location, (d) Illustration of possible valid dot positions (grid). 



2.6 Braille Character Recognition 

The two identified grids of protrusions and depressions (Sec. 2.4) contain information 
not only on the valid positions of dots on the whole page but also on the groupings of 
columns (within lines) that can form Braille characters. Having identified all possible 
valid dots, the system at this stage segments the Braille characters in the image (both 
sides of the page) and recognizes the corresponding codes. A code in this case is a 
unique identifier of each Braille character. This digital representation of a Braille 
character takes the form of a 6-bit word (or 8 bits in 8-point cell Braille), each bit 
position corresponding to the dot position in the Braille character. 

In each of the two grids, each grid point within a character is examined. When a dot 
is found at that point the bit corresponding to that position in the Braille character is 
switched on (‘1’), otherwise it is left off (‘0’). An illustration of the digital representa- 
tion of the Braille characters found in the example of Fig. 4(a) can be seen in Fig. 5. 



2.7 Braille Character Interpretation 

At this stage the Braille character codes are translated into their textual equivalent. In 
contrast to traditional OCR, this is not straightforward as there is significant ambiguity 
depending on the type of Braille used. As mentioned in Sec. 1, the same Braille char- 
acter will correspond to a single printed character in Grade 1 Braille, but may possibly 
correspond to a string of characters in Grade 2 Braille (assuming at least that the lan- 
guage of writing is known) or a non-character symbol in other types of Braille. 

Unfortunately, the type of Braille and originating language is not indicated within 
the Braille document. If this information is obtained from a different source (e.g., a 
librarian), there is still not a direct correspondence between Braille codes and printed 
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characters. In Grade 2 Braille the corresponding string of characters will depend on 
the position of the Braille character within a line (relative to other Braille characters). 
Even in the simpler case of Grade 1 Braille, some Braille codes have dual meanings 
depending on context. For instance, a given code may represent a single letter unless it 
is preceded by a specific code, which signifies that the following codes should be 
interpreted as digits [1]. 
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Fig. 5. Illustration of Braille codes and Grade 1 English translation of the characters in Fig. 4. 

To demonstrate the functionality of this and further stages, the system currently 
supports Grade 1 English Braille. Further types of Braille can be supported by imple- 
menting suitable extensions. An illustration of the translation of Grade 1 English 
Braille codes to the corresponding printed characters is shown in the rightmost column 
in Fig. 5. 

2.8 Character Validation and Correction 

A unique characteristic of the system that further enhances its robustness is its provi- 
sion for post-processing to recover from any recognition errors as well as from possi- 
ble errors during the production of the Braille document (mistyping errors). Here, 
knowledge of Braille is used as well as of the language of the writing. This can be 
thought of as the equivalent of prioritizing alternative recognition results depending on 
a ranking of the most common spelling mistakes in conjunction with existing character 
features. 

In contrast to printed character recognition (where there are distinct structural dif- 
ferences and similarities between characters), the nature of Braille characters does not 
provide with equivalently suitable clues. A given Braille character can be easily con- 
verted to another (similarly valid) one by the presence or absence of one or more dots 
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at the pre-specified locations. It is not therefore possible to identify a suitable alterna- 
tive to a “wrong” Braille character simply by examining its form (code). 

On the other hand, examining only the translated text does not help in the correct 
ranking of the potential correct alternatives of wrong words. This is because, given a 
suspect Braille character, not all potential printed alternatives may be possible to be 
derived. Instead, only those printed characters that can result from a (limited) incre- 
mental addition of missed, or deletion of detected, dots of the given Braille character 
should be considered. 

The system works at the word-level taking into account the structure of the underly- 
ing Braille characters (that give rise to each printed word in consideration). First, the 
Braille codes of the entire document are translated into the appropriate printed textual 
representation (here this is English, as explained in the previous section). The resulting 
text is then split into words. Each of these words is then looked up in a dictionary. 
Words that are found in the dictionary are left unchanged. Those that are not found in 
the dictionary are subjected to two stages of transformation in order to identify a suit- 
able replacement. 

In the first stage, the system searches (in the English dictionary) for words that are 
similar to the detected word but with one dot altered in the constituent Braille codes. 
To do this, n (where n is the number of dots in the word) copies of the word (as cur- 
rently expressed in Braille codes) are created and in each one, a different dot is re- 
versed from its current state (i.e., an existing dot is removed or an additional dot is 
introduced). These words are then evaluated. If a match is found in the dictionary for 
just one of the new words, the original incorrect word is replaced with that. If more 
than one match is found in the dictionary, they are presented as alternative options. 

If no suitable matches are found in the first stage, the post-processing method pro- 
ceeds to a second stage where, in a similar manner to the previous stage, n ( « - 1 ) 
copies of each incorrect word are created. Each alternative word is derived from the 
original by reversing the state of two dots in the underlying Braille codes. These words 
(the translation from the Braille codes) are then looked up in the dictionary and, de- 
pending on the number of correct matches, the original incorrect word is either di- 
rectly replaced or (if more than one match is present) alternative options are listed. 



3 Results and Concluding Remarks 

The system has been tested with a wide variety of scanned Braille documents, both 
single and double sided, written in different languages and Braille types (including 
music), produced with different mechanical methods and scanned with different scan- 
ners. Due to the sheer variety of documents, it has not been possible to obtain textual 
ground truth. Eor our tests, ground truth was created manually by counting the number 
of Braille dots and characters in each document (both sides of a page if characters are 
present on the verso). As such, the general recognition results given here refer to the 
detection of dots and characters. 

Due to the nature of Braille documents, it is expected for most documents to have 
some defects such as slight variations in background colour and a small number of 
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dark spots (one type of paper imperfection). Such documents are referred to as being 
of average quality. Documents that are considered to be of progressively worse (re- 
ferred to as being of low and very low) quality may have a larger number of dark 
spots, excessive paper grain and large variations in background colour. In addition, 
image quality depends on the extent of scanning artefacts, such as the effect of non- 
uniform illumination. These problems generally cause the lightness of the highlight 
part of Braille dots to be much less distinct from the page background and spurious 
regions to appear in the image that could be misidentified as the highlight or shadow 
parts of dots. 

Overall, on single-sided documents with average defects 100% of the dots are cor- 
rectly recognized. This rate drops slightly to 98.2% and 97.6% for low quality and 
very low quality documents, respectively. In terms of character recognition, on the 
same documents, the system achieves 99.9% for documents with average defects, 
97.4% for low and 94.9% for very low quality documents. 

On double-sided documents the overall dot detection performance of the system is 
at or slightly below 99% for both average and low quality (exhibiting excessive noise 
and detrimental illumination defects) documents. In terms of the character recognition 
performance, the system achieves 98.7% for average quality documents, 98% for low 
and 94% for very low quality documents. 

Typical failure cases in the single-sided documents result from defects present in 
the image, such as worn (from prolonged reading) protrusions and paper defects 
(marks resembling valid dot components). In the case of double-sided documents, the 
additional problem of the merged dot components (white and/or black regions) is 
present as a potential cause of missed dots. 

To put these results into perspective, the documents scanned by the authors using a 
standard scanner, all fall within the average difficulty category, where the system 
achieves near perfect results. The low and very low quality documents were obtained 
already scanned, some about a decade ago, using quite old scanner technology and as 
such, the authors are inclined to not consider them as relevant to the present and future 
performance of the system. 

Moreover, the above results do not take into account (for reasons of clarity and to 
demonstrate the broad applicability of the system) the post-processing stage. While 
there are clear indications of performance improvement by including that stage, the 
dataset currently used does not lend itself to a representative and accurate quantifica- 
tion of its performance (as explained in Sec. 2.8, the post-processing stage is not im- 
plemented for all the different types of Braille and languages presenting the dataset). 

For comparison purposes, the dataset used by Mennens et al. [5] was obtained. This 
was limited to 7 documents and included cases in all three categories of quality grade. 
As Mennens et al. do not, however, specify for which images they obtain which spe- 
cific results, the discussion here is limited to overall performance. The system pre- 
sented here successfully recognized all documents with at least comparable results. 
Moreover, it is the authors’ expert opinion that Mennens’ et al. approach would fail in 
the very low quality documents, whereas the system described here achieves 94% 
success. In terms of the performance of the system by Ritchings et al. [6], the current 
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system shows a significant improvement as a combination of the improved image 
quality as well as the enhanced robustness aspects introduced. 

Overall, the system described in this paper has distinct advantages over previous 
approaches while it maintains a cost effective and visually-impaired-user-friendly 
realization. Robustness to cope with low quality scans and defective documents is 
built-in at different levels, from the initial thresholding through to the flexible Braille 
point grid construction. More significantly, though, a word-level post-processing stage 
is introduced that takes into account knowledge of both the language of the corre- 
sponding printed text and the unique characteristics of Braille character structure. 

Further work is focused on eliminating some of the errors attributed to card (paper) 
defects and on expanding the translation capabilities to different types of Braille. 
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Abstract. Security concern about document images is important since 
document images are distributed in large amount both electronically and 
physically. They are easily copied and the copyright information is diffi- 
cult to identify. In this paper, we present a new algorithm for document 
image watermarking, which is based on weight-invariant partition of the 
document in the spatial domain, where the weight of a partition is the 
average number of pixels in lines within the partition. We discuss the 
issues of robustness, security, capacity and fault-tolerance related to the 
watermarking method. In order to simultaneously achieve high capacity 
and security, the partition method is further improved using the support 
vector machine technique. The experiments indicate the soundness of our 
method. 



1 Introduction 

Security issue about digital data has attracted a lot of research attention recently. 
In order to protect data against illegal access and unauthorized reproduction, 
various kind of methods based on encryption and watermarking have been pro- 
posed. The encryption techniques make images invisible to unauthorized users 
while watermarking methods make images visible to all users. Many methods 
such as [10, 12, 16, 17, 19] have been proposed for general (i.e., gray-scale and 
color) image watermarking. 

In particular, security concern about document images is important since 
document images are distributed in large amount both electronically and physi- 
cally. They are easily copied and the copyright information is difficult to identify. 
However, most general image watermarking methods are based on “transform- 
domain” techniques and are less useful for embedding watermarks into document 
images because their modifications tend to be visible and are easily removed by 
binarization [13]. Some methods specific for document image watermarking have 
been proposed in the literature such as [1,3,4, 14]. 

This paper presents a new algorithm for document image watermarking based 
on weight-invariant partition. In the algorithm, we first partition the document 
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into several subdivisions. For each subdivision, we make minor modifications such 
that some lines are suitable for embedding bits and the weight of the subdivi- 
sion (i.e., the average number of pixels in lines within the subdivision) remains 
unchanged after the embedding phase. Several important issues related to the 
method such as robustness, security, capacity and fault-tolerance are discussed in 
detail. In order to simultaneously achieve high capacity and security, the water- 
marking problem is further reduced to a classification problem and the support 
vector machine technique is applied to solve it. 

The rest of the paper is organized as follows: Section 2 describes the new 
method for document image watermarking based on weight-invariant partition. 
Section 3 describes an improved partition method using support vector machine. 
Section 4 shows the experimental results. A summary of work is given in Sec- 
tion 5. 



2 A New Algorithm for Document Image Watermarking 

2.1 Preliminaries 

Suppose we want to embed the binary sequence r into a document. To partition 
a document is to divide the whole document into pair-wise disjoint parts. In 
the following, we abuse the term “partition” a little: we also let partition mean 
the disjoint part in the above definition. Let S{P) of cardinality n (|S'(P)| = n) 
denote the set of all text lines in a partition P. The weight w(li) of a text line 
li is the number of total pixels in li . The sum weight of P is the sum of weights 
of all text lines in S{P). The average weight A(P) equals to the sum weight of 
P divided by n. We also call A{P) the weight of the partition P. Let Sin of P 
denote the set of text lines with weights between A{P) ± (5, (5 being a positive 
integer to be discussed later. Let Sout be S' — Sin- The key observation of our 
method is that A{P) of a partition P is not likely to be significantly changed 
due to noise. Therefore, A{P) is a partition modification invariant. 



2.2 Embedding and Extracting Bits in a Uniform Partition 

First assume that we have eliminated the text lines which are too short, however, 
it is not necessary if we can partition the document in a nice way (refer to 
Section 3). Informally, a partition P is a uniform partition if half of lines in 
S{P) have weights close to A{P). From this, we expect that the median and 
the mean of weights of lines in S{P) are roughly equivalent. Therefore, if <5 
is appropriately chosen, |Si„| will not be too small and we can embed enough 
bits into the partition as follows. We first modify the partition (by adding or 
removing at most 5 pixels from every line in Sm) so that all lines in Sin have 
weights of exactly A + 6 or A — 6. We call it standardization process. Suppose 
that we have added r pixels in a total (note that r may be negative) through 
the process. In order to maintain A{P) of the partition, we must accordingly 
remove r pixels from lines in Sout, which is called flushing. We use late flushing 
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strategy, that is, the flushing is delayed till the end of the embedding process. 
After the standardization process, r is sequentially embedded to each line in Si„. 
We will further modify the partition only when embedding 0: we add or remove 
S/2 pixels to li such that after modification, li has a weight of exactly A + 6/2 
or A — 5/2. To maintain A{P), r also needs to be accordingly updated during 
this process. 

We now show how to determine S. Since half of lines have weights close 
to A(P), we first increasingly sort all lines in the partition according to their 
weights and select the middle n/2 lines starting with li ending with Ij. Set 

5 = min{A — |?i|, |/j| — A}. Then we have n/4 lines for flushing for either sign of 
r. On the other hand, due to the important fact that S can be easily computed 
by extractor (see below), it need not be fixed and can be determined by various 
strategies as long as the embedder can have enough lines in Sout for flushing 
without noticeable modifications to the document. Therefore, by varying 6, we 
may select other number of lines (e.g., the middle 2n/3 lines) to form Sin- 

In the process of flushing, if r < 0, we add r pixels to Sout to maintain A{P). 
Since n/4 lines have weights larger than A + (5 in Sout, we can uniformly flush 
the pixels to these lines, i.e., 4r/n pixels are added to each of n/4 lines. Clearly 
after flushing, the weights of these lines will become even larger than A + 5. We 
similarly treat the case for r > 0. Finally note that we will embed 01 before t to 
ensure that at least one 0 and 1 are embedded, which is useful for the extractor. 

The process of extracting watermarks from the embedded document is simple. 
Since the weight of a partition is not changed by our embedding process, we 
can simply search the embedded partition, compute A(P), and then compute 

6 as follows: And the line k such that |A — w{li)\ = miufc |A — w{lk)\, then 
6 = 2\A — w{li)\. With 6, the extraction is straightforward. Since there always 
has small noise, we can set up an error-tolerant constant ^ for practical purpose, 
i.e., we treat S fa S 

The following principle is applied for adding (resp. removing) pixels to (resp. 
from) a text line. We first try to add (resp. remove) pixels to (resp. from) large- 
weight characters, and then uniformly modify the remaining characters in the 
text line. For a single character, the principle is to modify its boundary. These 
strategies usually lead to subtle modification indicated by our experiments. 



2.3 Related Issues 

A natural question now arises: how can we partition the whole document into 
smaller uniform partitions? The simplest way is that we first group a fixed num- 
ber (which may be one) of consecutive pages as a partition and then check for the 
uniform condition. If the partition is not uniform, it is divided into smaller ones. 
This process is repeated until certain partitions are uniform. The main assump- 
tion of the above strategy is that a reasonable number of physically consecutive 
lines can form a uniform partition, however, this is not always true. Even when 
it is the case, partitions formed in this way are usually small (while there are 
not many uniform partitions), which greatly limits the capacity for embedding 
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bits. Therefore, we propose an improved partition method in Section 3 without 
this assumption. 

For robustness, we note that the average weight of a partition is unlikely to 
be significantly changed due to noise, assuming that the partition itself is large 
enough. In addition, we can scale up <5 to achieve good robustness, however, it 
cannot be raised too much, otherwise the modification would be noticeable. For 
higher robustness, we resort to the techniques in fault tolerance, i.e., we can 
modify r by using any error correcting code or simply the repetition code before 
the embedding process. 

In the following, we propose an improved method for embedding t using 
repetition code. We here only discuss the case where a bit is embedded twice in 
a single partition. Assume that at most one copy of a bit can be damaged and 
the damaged bit is distinguishable from both 0 and 1. If originally n/2 bits can 
be embedded, now only n/4 bits can be embedded if the naive repetition code is 
used. We are to describe an improvement by which n/3 bits can be embedded in 
the best case. The new strategy is that 0 is embedded as 01, and 1 as 10. Suppose 
r = (1, 1, 1, 1, 1). The sequence is naively embedded as (1, 1, 1, 1, 1, 1, 1, 1, 1, 1), 
while it is (1, 0, 1, 0, 1, 0, 1, 0, 1, 0) by the improved strategy. Clearly, the number 
of pixels to be flushed is reduced from r to r/2. Suppose originally Cn/2 pixels 
are flushed to the largest- weight n/4 lines in Sout, which meets the flushing 
limits. Naturally, the flushing limits of the largest-weight n/6 lines in Sout are 
CnjZ pixels. Assuming that Sin is formed by the middle 2n/3 lines, we then 
need to replace 2Cnj?> pixels from middle 2n/3 lines to the largest n/6 lines. By 
our trick, r pixels can be reduced to r/2 pixels in the best case, therefore, only 
Cn/3 pixels need to be flushed to n/6 lines, which is already shown possible! 
Hence in the best case, we can embed bits to 2n/3 lines rather than n/2 lines 
by the improved embedding strategy. Finally, since noise may usually destroy 
consecutive lines, a bit and its complement are embedded as physically far as 
possible, e.g., if we want to embed (1, 1,0, 1,0) into a partition, we can embed 
it as (1,1, 0,1, 0,0, 0,1, 0,1). 

3 Partition Method Using Support Vector Machine 

If the document is not partitioned in a regular way, the watermark will be 
more secure, since one must know the partitions before extracting bits from the 
document. In addition, we expect that most lines in a partition are similar to 
each other in order to embed enough bits. However, it is not easy to achieve 
so in general if we always consecutively select lines. Therefore, we would like to 
compute the partition composed of logically close lines, which are not necessarily 
physically close. For this, we reduce the partition problem to a classification 
problem as follows. We first select some featured lines from the document as 
training data, then use them to train a support vector machine. Finally, we use 
the support vector machine to classify all other lines. We begin with a brief 
overview of the support vector machine technique. 
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3.1 Brief Overview of Support Vector Machine 

The support vector machine (SVM), originally introduced in [5, 20], is a relatively 
new supervised machine learning technique which has been successfully applied 
to various problems such as face identification, text categorization and database 
marketing (see, e.g., [2, 6, 8, 9, 18]). SVM is a systematic methodology with both 
theoretical guarantees and practical robustness for pattern classification and 
nonlinear regression. Basically, SVM constructs a linear classifier in a feature 
space for a given set of binary labelled data through computing the maximal 
margin hyperplane that correctly separates the largest fraction of data points 
while maximizing the margin between nearest data points called support vectors. 

Given a set of labelled training data (xi^yi), (x 2 ,j/ 2 ), • ■ • j {xi,yi), Xi G i?", 
yi G { — 1,+1}, an SVM constructs a linear classifier in a feature space F, which 
is nonlinearly related to the input space via a map <j) : i?” ^ F. The classifier is 
identified as the hyperplane 

w • 4>{x) + b 

in F such that 
is minimized subject to 



yi{4i{xi) ■w + b + 

for all i, where C is a constant and the are positive slack variables introduced to 
handle the non-separable case. Note that it is a quadratic programming problem 
and its Lagrangian dual is 

max oijyiyj(l){xi) ■ (j){xj) 

i 

s.t. ^ a^yi = 0, 0 < Oj < C 

i 

Then the decision function to classify a new data point is 
f{x) = sgn(^ aiy^(j){xi) ■ (j){x) + b) 

i 

For computational efficiency, the mapping (j) is often implicitly performed using 
a kernel function K defined as K{xi, Xj) = 4>{xi) ■ 4>{xj). A kernel function must 
satisfy: 

J K{x,y)g{x)g{y)dxdy>0 

for any g{x) (resp. g{y)) such that f g(x)^dx (resp. / g{y)^dy) is finite. In our 
problem, we adopt a popular kernel 



K{xi,Xj) = {x^ ■ Xj + 1 )^. 
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For an in-depth introduction to the support vector machine, we refer the inter- 
ested readers to [5] . 

Since the above technique can only be used to compute a two-class classifier, 
we extend it to compute a multiple-class classifier as follows. Given n data points, 
we first train SVM by them and obtain two classes, then we recurse on these 
two sets respectively until we have n classes. 

3.2 Partition Method Based on Support Vector Machine 

Since SVM always tries to find the clusters grouping the most similar data 
points, we would expect that most lines within a resulting partition have similar 
weights and therefore help achieve good capacity. Note that lines in a resulting 
partition need not lie close to each other physically in the original document. 
Training data for SVM may be a collection of any lines or part of lines in any 
place of the document. However, without carefully selected training data, it is 
possible that almost all test data are classified into very few classes while the 
majority of classes contain almost nothing. In order to “balancedly” classify the 
test data, we propose a heuristic for selecting training data as follows. The basic 
idea is that we compute some components from the graph corresponding to our 
line-selection problem and select from each component a representative node as 
a training example. Consider an undirected complete graph G = {V, E) where 
every line k in the document becomes a node Vi and the weight of the edge 
linking Vi,vj is (the absolute value of) the difference between the weights of 
two corresponding lines k,lj. We want to compute a new graph G' = {V',E') 
where V = V' such that there are exactly k (to be specified by user) components 
in G' and each component contains no cycle. Initially, E' is empty and G' has 
\V'\ components. We always add the longest possible edge (If there is a tie, we 
arbitrarily choose one) from G to G', which can connect two disjoint components 
in G' to form a new component. The algorithm terminates when the number of 
components reaches k. Then the training example for each component is selected 
as the node where the sum of weights of incident edges in the clique (induced 
from G) containing all nodes within that component is minimum. Each training 
example is assigned to a unique class, which will be used for training the SVM. 

4 Experimental Results 

We first carry out the algorithm to embed three sequences ti =“01011010”, 
T 2 =“00001001”, T 3 =“11011001” into a single uniform partition respectively. 
Recall that we need to embed “01” before each t. The original partition and 
the embedded partition (for ti) are shown in Figure 1. The average line weight 
and its standard deviation, the average weight of characters and its standard 
deviation are shown in Table 1. From it, one sees that after embedding the 
watermarks, the average values of line weight and character weight change only 
little while their standard deviation change relatively large. However, we note 
that standard deviation does not change very large due to the principle for pixel 
addition/removal and the late flushing strategy. 
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Table 1. Experiment in a single partition. 



Measure 


Original 


Embed 


Embed S 2 


Embed S 3 


L.A. 


4927 


4933 


4939 


4914 


S.D. 


242663 


290021 


329087 


361044 


C.A. 


137 


137 


137 


136 


S.D. 


261 


498 


580 


547 



Table 2. Experiment in a real document. 



1 Original Document 


jEmbedded Document! 


L.A. 


S.D. 


L.A. 


S.D 


2623 


31932 


2620 


53587 


3404 


38627 


3409 


61091 


2498 


34910 


2492 


73840 


2010 


10354 


2016 


24119 


1567 


12538 


1565 


29573 


1859 


9952 


1856 


15960 


2519 


25906 


2518 


46938 


3021 


30854 


3016 


54829 


2294 


17891 


2293 


35082 



We next apply the method based on SVM-partition to embed a 128-bit se- 
quence into an eighteen-page double-column paper with fault tolerance. We first 
preprocess the document such that the document contains only text for experi- 
ment. The training data for SVM are selected in the way discussed in Section 3.2 
(we select k = 9). The fault-tolerance strategy is the one discussed in Section 2.3. 
The average line weight and the standard deviation of the original document and 
the embedded document are summarized in Table 2. As before, the average line 
weight remains, while the standard deviation differs. Figure 2 shows which par- 
tition each indexed line of the document is classified into (Note that we index 
lines in the left column followed by those in the right column per page). From 
Figure 2 and Table 2, one sees that within each partition, the lines are not al- 
ways physically close but the weights of them are close. Therefore, we can embed 
enough number of bits into the partitions. 

5 Conclusions 

In this paper, we propose a partition-based algorithm for document image wa- 
termarking. In the algorithm, we first partition the document into several sub- 
divisions using support vector machine, which can simultaneously lead to high 
capacity and security in our case. For each subdivision, we make minor modifi- 
cations such that some lines are suitable for embedding bits. The weight of each 
subdivision remains unchanged after the embedding phase and is not likely to 
be significantly changed by noise which makes the watermarking method robust. 
Our experiments indicate that the method is practical. 
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to it, we pursue tbe search among iu children. Otherwise, we 
can check whether the line t line supporting cuts through 
r. If it docs not. there is no need to pursue the search. If 
it docs, however, vre know that < intersects at least one blue 
face. This follows from the fact that I passes below a blue 
edge and above another one- In that case, we pursue the 
search by repeating the same test with respect to the two 
subtrees hanging from the root of r (assuming that we have 
not already reached the bottom of the tree). If i intersects 
blue faces with edges associated with then 0(k« log n) 
suck ’^probes” will be performed (some intersections will be 
detected at the "seams" between two halves of some tested 
portion of r, when the last edge of the first half lies above 
i and the first edge of the second half ties below e, or vice 
versa). In addition, the same intersection may be discovered 
while doing work in up to O(log,n) distinct B^'s. 

All this would be perfect if r did represent a contiguous 
portion of (over <Tv). In practice, however, this need 
not be the case, because there might be "uncharted regions" 
between ruljacent segments in r, corresponding to faces of 
that arc either too long or too short to be represented at 



to it, we pursue the search among its children. Otherwise, wc 
can check whether the line t line supporting f. cuts through 
r. If it docs Dot, there is no need to pursue the search. If 
it docs, however, wc know that i intersects at least one blue 
face. This follows from the fact that t passes below a blue 
edge and above another one. lit that case, we pursue the 
search by repeating the same test with respect to the two 
subtrees hanging from the root of r (assuming that we have 
not already reached the bottom of the tree). If i intersects 
k« blue faces with edges associated with then 0(k«logn) 
suck "probes'* will be performed (some intersections will be 
detected at the "seams" between two halves of some tested 
portion of r, when the last edge of the first half lies above 
e and the first edge of the second half ties below e, or vice 
versa). In addition, the same iatersection may be discovered 
while doing work in up to O(logrfi) distinct B^'s. 

All this would be perfect if r did represent « contiguous 
portion of (over <Tv). In ptactice, however, this need 
not be the case, because there might be "uncharted regions" 
between adjacent segments in r, corresponding to faces of 
that arc either too long or too short to be represented at 



Fig. 1. The original partition (left) and the embedded partition (right). 




Fig. 2. Distribution of lines in partitions. 



If we excessively partition the document, e.g., every few words form a subdi- 
vision, then the embedded information is fragile, i.e., even slight changes to the 
embedded document may destroy the embedded bits. We therefore expect that 
our method can be extended for image authentication, which attracts a lot of 
research attention such as [7, 11, 15, 21, 22] recently. 
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Abstract. As the rapid popularization of digital imaging equipment, video 
character recognition becomes more and more important. Compared with tradi- 
tional scanned document, characters in video document usually suffer from 
great degradation and meet trouble in recognition. Thus, a systematically study 
of video degradation will be very useful for video OCR. In this paper, a video 
degradation model is proposed to imitate the process of video character image 
generation. The generated character images are used to make synthetic diction- 
aries to improve the recognition performance of real degraded characters in e- 
Leaming videos. Experiments on 24317 e-Learning video characters prove the 
effectiveness of our method. 



1 Introduction 

Recently, digital imaging equipments like digital camera, digital video, and video 
mobile phone become more and more popular. The demand of OCR for video charac- 
ters also increases dramatically. For good quality images, traditional OCR can always 
get good performance. But many video images suffer from great degradation caused 
by low image resolution, small contrast, large visual angle, compression, etc. These 
degradations will cause problems for OCR. A systematically study of character deg- 
radation is in great need. Many degradation models have been proposed [1][2][3] 
recently for scanned document. In this paper, a video degradation model is presented 
to imitate the process of video image generation. The synthetic video character im- 
ages can be used for performance evaluation and dictionary data augmentation. In our 
experiment, the generated images are used to make degraded dictionaries to recognize 
characters in e-Learning videos. Evaluation results show that these synthetic character 
images are very helpful for real degraded character recognition. 

2 Video Degradation Model 

Our video degradation model is based on perspective transformation and super sam- 
pling. As shown in Fig. 1, a clear character image is rendered into scene plane. The 
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scene plane is rotated around 3 axes and the character image is projected into image 
plane. Using the knowledge of perspective transformation, a pixel in the image plane 
is mapped into a quadrangle region in the scene image. So the grayscale value for the 
pixel is calculated as the mean grayscale value of the region in the scene plane. 




3 Recognition for e-Learning Video Characters 

The generated characters can be used for performance evaluation and training data 
augmentation. Currently, we focus our application on character recognition in e- 
Learning videos because many characters in e-Leaming video have very small size 
and different background color. Standard OCR can’t get satisfactory recognition 
performance on these data since the training data is obtained from scanned clear char- 
acter images. Our idea is using degraded image generated from our model to make 
degraded dictionary to improve the recognition rate. 

1) Model simplification 

For e-Learning videos, the rotation angle of scene plane is usually not very large. 
Also, the distance between camera center with scene plane is far larger than focal 
length. Thus, the influence of scene plane rotation for character image is mainly on 
the changes of aspect ratio. Since size normalization is one preprocessing of OCR 
that can counteract this influence, we set the rotation matrix to unitary matrix. 

2) Contrast estimation 

Histogram of every text line image is obtained and the grayscale values for the two 
peaks in the histogram which corresponding to background and character strokes is 
located. The two values are used to adjust the grayscale of the generated character 
image so that the contrast value is the same as the real cases. 
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3) Deal with blurring 

In our model, we use super sampling to imitate the blurring effect caused by various 
reasons. The degree of blurring is controlled by the shrinking rate of the generated 
character size to the original character size. We don’t estimate the exactly level of 
blurring, but instead generate several patterns for every category with different 
shrinking size to make several dictionaries with different degradation level. 

4) Experiment and discussion 

The testing data contains 24,317 binary multi-font characters extract from 22 real 
lecture videos, including Japanese class 1 and partly class 2 Kanji character, alphabet, 
digital, symbol, katakana and hiragana. The original training data come from scanned 
character images with 4299 categories. The binary character images are normalized to 
size of 64*64. Feature extraction method is contour chain code based method [4]. 
Simple NN classifier is used in classification. 

The degraded dictionaries are made by degraded patterns with shrinking rate of 20%, 
40%, 60%, 80% and 4 levels of contrast: 64, 128, 192, 255. So the total number of 
degraded dictionaries is 16. The recognition rates by using different combination of 
dictionaries are recorded in Table. 1. Dictl means original dictionary. Dict2 includes 
4 degraded dictionaries from 4 shrinking rate with contrast estimation. Dict3 includes 
Dict2 plus original dictionary. Dict4 includes all 16 degraded dictionaries. Dict5 
includes Dict4 plus original dictionary. 



Table 1. Recognition rate using different dictionary 



Dictl 


Dict2 


Diets 


Dict4 


Dict5 


Recog. rate 84.54% 


90.43% 


90.55% 


91.49% 


91.57% 



There are several interesting points we learned in the experiments: 

1. The performance of Dict2 is inferior to Dict4. Because for every text line, we have 
only 4 levels of contrast values in the dictionary, so we have to choose the diction- 
ary with the most similar contrast and there is some estimation error. What’s more, 
using all 17 dictionaries gets the best result, but it also is the slowest one. If we 
don’t consider computation power, much more patterns can be generated by using 
more contrast level and shrinking rate, and the recognition rate is expected to be 
even higher. 

2. The degradation model shows different result for Kanji and alphabet. Fig. 2 shows 
the size distribution of all Kanji and alphabet characters in testing set, and that of 
mis-recognized Kanji and alphabet characters. For Kanji, the distribution changes 
a lot for error cases. The errors mainly happen in very small size (about 10 pixels) 
since the structure information is heavily degraded at such scale. But for alphabet, 
the size distribution doesn’t change so much in error cases. The main reasons for 
the errors include stroke missing like ‘i’ degraded to T’ and shape similarity like 
‘s’ misrecognized as ‘5’. Our video degradation model seems more effective on 
character set with abundant structural information like Chinese and Japanese. This 
is also proved in the recognition rate: the recognition rate improves from 81.41% 
to 96.55% for Kanji, but only from 75.70% to 84.64% for alphabet. 
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3. Degradation model doesn’t work well for very small size characters. At very low 
image resolution, the binarization looses too much valuable information for recog- 
nition. Either an improvement in binarization or feature extraction directly from 
grayscale image is needed. 




Fig. 2. Size distribution for Kanji and alphabet. 



4 Conclusion 

In this paper, a video degradation model is proposed to produce synthetic character 
patterns with different degradation level. The generated patterns are used to make 
degraded dictionaries to improve the recognition performance of characters in e- 
Learning videos. The advantages and shortcomings of the model are also discussed. 
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Abstract. In this paper we present the preliminary results and the eval- 
uation of a combined thematic segmentation of (a) meeting documents 
and (b) meeting speech transcript. Our approach is based on a clustering 
method applied on a 2D representation of the thematic alignment, and 
then the projection of the extracted clusters on each axis, corresponding 
to meeting documents and the speech transcript. Finally, our bi-modal 
thematic segmentation method is evaluated, in regards to a mono-modal 
segmentation method {TextTiling). 



1 Introduction 

In the context of multimodal applications, especially meeting recordings and 
lectures, research are in hand, in order to establish temporal links between the 
various modalities, mainly between documents and meetings dialogs [5]. Our 
viewpoint is that bridging temporal links between these two modalities may be 
attained once their thematic links, i.e. their thematic alignment, are established. 

The document/speech thematic alignment and the thematic segmentation 
are closely related. The thematic alignment is building thematic links between 
documents and speech units, which are semantically close. While thematic seg- 
mentation builds thematic links between units of a unique modality (document 
or speech). Thematic segmentation is thus an intra-modal segmentation, while 
thematic alignment is an inter-modal segmentation. Since the preliminary eval- 
uation we have performed on state-of-the-art, thematic segmentation methods 
did not show good results, our assumption is that an inter-modal segmentation 
will be more efficient and will benefit from the various modalities information. 
In this article, we present briefly our bi-modal thematic segmentation method 
and its projection to each modality. A preliminary evaluation shows that our 
bi-modal segmentation is more efficient than a mono-modal segmentation. 



2 Thematic Alignment vs. Thematic segmentation 

Our document /speech alignment takes as input the speech transcript of a meet- 
ing and the documents related to the meeting, and generates a set of aligned 
pairs (document units, speech units) [5]. Currently, we are focusing on press 
reviews, where many speakers discuss a daily newspaper cover page. 
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Fig. 1. a. Bi-graph representing the thematic alignment b. 2D representation of the 
bi-graph c. Clusters projection. 



The information contained in the documents in PDF form, is first extracted 
and then automatically converted into a multi-layered structure (layout, logical 
and syntactical structures) [2]. The document logical structure is a hierarchical 
decomposition of the document into a set of labels. The logical structure of our 
newspaper cover page is a set of articles, where each article contains a title, 
content and an author, etc. However, the syntactical structure is a segmentation 
of the document into a set of textual components, e.g. sentences and paragraphs. 
From another side, the speech is currently manually transcribed, and is composed 
of thematic episodes, which contain many speakers’ turns. Each turn contains 
at least one utterance, which is a homogeneous speaker part. 

Once the document and the speech transcript structures are acquired, a 
matching process based on various similarity methods {Cosine, dice, Jaccard) is 
achieved between the various pairs (e.g. sentences with utterances, turns with 
logical blocks, etc.) [4]. Thus, for a given unit from the source file (document or 
speech), all the similar units in the target file are selected (figure l.a). 

This thematic alignment, which is a symmetrical relationship between the 
document and speech units, can be represented by a 2-dimensional graph, where 
each dimension represents a distinct modality (figure l.b). Each node in this 
representation is a relationship between the document and speech units (e.g. 
utterance 79 with sentence 8 has a similarity value of 0.57), and the node size 
represents their similarity value. Starting from this 2D representation, a cluster- 
ing process based on an improved K-means method has been applied in order to 
bring to light the denser regions, that we believe that may represent the various 
meeting topics. This clustering method was enriched by a filtering step of the 
weak densities clusters, by considering the clusters size, the nodes weights and 
distances (e.g. Euclidean distance) from the clusters centroids. 

Once the denser regions are computed, they are projected on each axis in 
order to highlight the mono-modal thematic segments. In figure l.c the cluster 
A corresponds to a document segment Da and a speech segment Sa ■ 
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Table 1. Documents/Speech thematic segmentation evaluation. 
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Table 2. Pk evaluation of a bi-modal method, comparing to a mono-modal method. 
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2.1 Experimental Results 

Many metrics have been used in the evaluation of our method, in respect to a 
prepared manual ground truth: the entropy /purity and the Pk (Beeferman) met- 
ric [1]. The entropy measures the disorder of segments. On the other hand, the 
purity measures the fraction of generated segments that does not contain incor- 
rectly placed objects. For a perfect segmentation, the respective values for the 
two metrics are 0 and 1. The Pk metric measures the probability that a randomly 
chosen pairs of units at a distance of k units apart are inconsistently classified, 
with a value of 0 for a perfect segmentation. This metric is more adequate than 
a simple recall/precision that measures just the boundaries detection. For this 
experiment, the k parameter has been fixed to 4 units, which corresponds to the 
minimum size of a relevant thematic segment. 

Table 1 shows the evaluation of the thematic segmentation of 8 meetings 
documents and speech transcripts. The generated entropy /purity values depend 
on the type of the meeting. Thus we distinguish two types: 

1. If the speakers do not follow the linearized documents reading order, then the 
temporal indexes of the document segments are not adjacent. This reduces 
the number of overlapped segments, and as a result, it gives the satisfactory 
values for the entropy /purity (e.g. documents Di, D 2 , D 3 and D 4 ). 

2. If the meeting is non-stereotyped, i.e. with numerous debates, then there are 
less overlapped segments (e.g. Sq, S 7 and 5'g). This is due to the fact that 
the speech segments are well separated each one from the other. As results, 
their entropy /purity values are better, comparing to stereotyped meetings. 

The Pk evaluation is generally satisfactory (see Table 2), especially in compar- 
ison to the TextTiling[3] method . Our bi-modal segmentation method is more 
accurate in detecting the exact number of thematic segments, which is not the 
case for the TextTiling method that generates many extra segments. 
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2.2 Remarks 

During the segments extraction process, overlapping problems often occurred. 
This kind of problems happens when a unit is assigned to many segments, and 
it mainly appears in stereotyped meetings. The relationship between the over- 
lapped segments can be one of two types: either one of them contains the other 
(e.g. in the figure l.c, Sd contains Sc), or they are partially overlapped (e.g. Db 
with Dc)- Our contribution in resolving this problem is under work, and is based 
on the Gaussian probabilistic function. First, an overlapping coefficient is com- 
puted. Depending on this coefficient, the corresponding segments are merged, or 
considered as two distinctive segments, using the Gaussian probabilistic. 

Other works are planned in order to improve this bi-modal thematic segmen- 
tation, such as the integration of the nodes weights in the clustering method, 
while computing the clusters centroids then while assigning the nodes to the 
clusters. 

3 Conclusion and Future Work 

This paper shows the results of the evaluation of a bi-modal thematic segmenta- 
tion method, based on a preliminary thematic alignment of meetings documents 
with speech transcripts. The comparison of this method with a mono-modal 
method, i.e. TextTiling method, shows promising results, despite the overlapping 
problem that affects the segmentation, and should be resolved. The segmenta- 
tion quality can be improved by considering the nodes weights earlier in the 
clustering process. Other prospects are foreseen, such as the combination with 
other alignments, for instance the speech turns with the documents logical units, 
or references to documents in meeting dialogs, citations, etc. In a long term, we 
plan to integrate all the various types of alignments in a single framework. 

This preliminary evaluation makes us believe that coupling modalities, in 
this meeting documents and speech transcripts, should considerably improve 
each involved modality segmentation. 
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